ipa vrp implementation in gcc
Hi All, I am looking at implementing a ipa vrp pass. Jan Hubicka also talks about this in 2013 GNU Cauldron as one of the optimization he would like to see in gcc. So my question is, is any one implementing it. If not we would like to do that. I also looked at the ipa-cp implementation to see how this can be done. Going by this, one of the way to implement this is (skipping all the details): - Have an early tree-vrp so that we can have value ranges for parameters at call sites. - Create jump functions that captures the value ranges of call sites propagate the value ranges. In 2013 talk, Jan Hubicka talks about - Modifying ipa-prop.[h|c] to handles this but wouldn't it be easier to have its own and much more simpler implementation ? - Once we have the value ranges for parameter/return values, we could rely on tree-vrp to use this and do the optimizations Does this make any sense? Any thoughts/suggestions to work on this is highly appreciated. Thanks, Kugan
Re: ipa vrp implementation in gcc
> Hello I am Vivek Pandya, I am actually working on a GSoC 2016 proposal > for his work and it is very similar to extending ipa-cp pass. I am also > in touch with Jan Hubicka. Hi Vivek, Glad to know that you are planning to work on this. Could you please put you plan in an accessible place (or post it here) so that we know what you plans are. That way we can work on what you are not working. And also possible contribute to your plan in other ways (like testing and reviewing). Thanks, Kugan
Re: ipa vrp implementation in gcc
Hi, > Another potential use of value ranges is the profile estimation. > http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf > It seems to me that we may want to have something that can feed sane loop > bounds for profile estimation as well and we can easily store the known > value ranges to SSA name annotations. > So I think separate local pass to compute value ranges (perhaps with less > accuracy than full blown VRP) is desirable. Thanks for the reference. I am looking at implementing a local pass for VRP. The value range computation in tree-vrp is based on the above reference and uses ASSERT_EXPR insertion (I understand that you posted the reference above for profile estimation). As Richard mentioned in his reply, the local pass should not rely on ASSERT_EXPR insertion. Therefore, do you have any specific algorithm in mind (i.e. Any published paper or reference from book)?. Of course we can tweak the algorithm from the reference above but would like to understand what your intension are. > I think the ipa-prop.c probably won't need any siginificant changes. The > code basically analyze what values are passed thorugh the function and > this works for constants as well as for intervals. In fact ipa-cp already > uses the same ipa-prop analysis for > 1) constant propagation > 2) alignment propagation > 3) propagation of known polymorphic call contextes. > > So replacing 1) by value range propagation should be easily doable. > I would also like to replace alignment propagation by bitwise constant > propagation (i.e. propagating what bits are known to be zero and what > bits are known to be one). We already do have bitwise CCP, so we could > deal with this basically in the same way as we deal with value ranges. > > ipa-prop could use bit of clenaing up and modularizing that I hope will > be done next stage1 :) We (Myself and Prathamesh) are interested in working on LTO improvements. Let us have a look at this. >> >>> - Once we have the value ranges for parameter/return values, we could >>> rely on tree-vrp to use this and do the optimizations >> >> Yep. IPA transform phase should annotate parameter default defs with >> computed ranges. > > Yep, in addition we will end up with known value ranges stored in aggregates > for that we need better separate representaiton > > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930 >> Thanks, Kugan
Re: ipa vrp implementation in gcc
On 19/01/16 04:10, Jan Hubicka wrote: In general, given that we have existing VRP implementation I would suggest first implementing the IPA propagation and profile estimation bits using existing VRP pass and then try to compare the simple dominator based approach with the VRP we have and see what are the compile time/code quality effects of both. Based on that we can decide how complex VRP we really want. It will be probably also more fun to implement it this way:) I plan to collect some data on early VRP and firefox today or tomorrow. Thanks. I started experimenting with it. Prototype patch is attached. I haven't tested it in any detailed way yet. This is just to understand the LTO and see how we can implement it. I wanted to set the value range to parameter based on the ipa-vrp. For example: extern void foo (int); void bar (unsigned long l) { foo(l == 0); } void bar2 (unsigned long l) { foo(l & 0x2); } unsigned long x; int main() { x = 0; bar (x); x = 1; bar (x); x = 3; bar2 (x); x = 5; bar2 (x); } In the above case, I wanted value range of the ssa_name that gets initialized to [0,2]. As can be seen from the ipa-cp dump (attached), this is now happening. Any comments ? I also have some questions: 1.I think even if we are not going to use the tree-vrp for intra-procedural value range propagation, we can factor out some of the routines and share it. Any thoughts on this? 2. Is the DOM based intra-procedural prototype Richard Biener implemented available anywhere. Can you please point me to that. Thanks, Kugan IPA structures before propagation: Function parameters: function foo/6 parameter descriptors: param #0 used undescribed_use function main/3 parameter descriptors: function bar2/1 parameter descriptors: param #0 used undescribed_use function bar/0 parameter descriptors: param #0 used undescribed_use Jump functions: Jump functions of caller __builtin_puts/7: Jump functions of caller foo/6: callsite foo/6 -> __builtin_puts/7 : param 0: CONST: &"test"[0] Alignment: 1, misalignment: 0 Jump functions of caller main/3: callsite main/3 -> foo/6 : param 0: CONST: 0 Unknown alignment callsite main/3 -> foo/6 : param 0: CONST: 2 Unknown alignment callsite main/3 -> foo/6 : param 0: CONST: 0 Unknown alignment callsite main/3 -> foo/6 : param 0: CONST: 1 Unknown alignment Jump functions of caller bar2/1: callsite bar2/1 -> foo/6 : param 0: UNKNOWN Unknown alignment Jump functions of caller bar/0: callsite bar/0 -> foo/6 : param 0: UNKNOWN Unknown alignment Propagating constants: Not considering foo for cloning; -fipa-cp-clone disabled. Marking all lattices of foo/6 as BOTTOM Not considering main for cloning; -fipa-cp-clone disabled. Marking all lattices of main/3 as BOTTOM Not considering bar2 for cloning; -fipa-cp-clone disabled. Marking all lattices of bar2/1 as BOTTOM Not considering bar for cloning; -fipa-cp-clone disabled. Marking all lattices of bar/0 as BOTTOM overall_size: 34, max_new_size: 11001 Estimating effects for bar2/1, base_time: 14. Estimating effects for bar/0, base_time: 14. Meeting [0, 2] and [0, 1] to [0, 2] Estimating effects for foo/6, base_time: 6. IPA lattices after all propagation: Lattices: Node: foo/6: param [0]: BOTTOM ctxs: BOTTOM Alignment unusable (BOTTOM) [0, 2]AGGS BOTTOM Node: main/3: Node: bar2/1: param [0]: BOTTOM ctxs: BOTTOM Alignment unusable (BOTTOM) UNDEFINEDAGGS BOTTOM Node: bar/0: param [0]: BOTTOM ctxs: BOTTOM Alignment unusable (BOTTOM) UNDEFINEDAGGS BOTTOM IPA decision stage: Evaluating opportunities for bar2/1. Evaluating opportunities for bar/0. Evaluating opportunities for foo/6. IPA constant propagation end Reclaiming functions: Reclaiming variables: Clearing address taken flags: Symbol table: puts/7 (__builtin_puts) @0x7ffa9ff50730 Type: function Visibility: external public References: Referring: Availability: not_available First run: 0 Function flags: Called by: foo/6 (0.19 per call) Calls: foo/6 (foo) @0x7ffa9ff505c0 Type: function definition analyzed Visibility: externally_visible public References: Referring: Read from file: t1.o Availability: available First run: 0 Function flags: Called by: bar/0 (1.00 per call) bar2/1 (1.00 per call) main/3 (1.00 per call) main/3 (1.00 per call) main/3 (1.00 per call) main/3 (1.00 per call) Calls: puts/7 (0.19 per call) x/2 (x) @0x7ffa9ff51000 Type: variable definition analyzed Visibility: externally_visible public common References: Referring: main/3 (write)main/3 (write)main/3 (write)main/3 (write) Read from file: t2.o Availability: overwritable
Re: ipa vrp implementation in gcc
On 18/01/16 20:42, Richard Biener wrote: I have (very incomplete) prototype patches to do a dominator-based approach instead (what is refered to downthread as non-iterating approach). That's cheaper and is what I'd like to provide as an "utility style" interface to things liker niter analysis which need range-info based on a specific dominator (the loop header for example). I am not sure if this is still an interest for GSOC. In the meantime, I was looking at intra procedural early VRP as suggested. If I understand this correctly, we have to traverses the dominator tree forming subregion (or scope) where a variable will have certain range. We would have to record the ranges in the region in subregion (scope) context and use this to detect more (for any operation that is used as operands with known ranges). We will have to keep the context in stack. We also have to handle loop index variables. For example, void bar1 (int, int); void bar2 (int, int); void bar3 (int, int); void bar4 (int, int); void foo (int a, int b) { int t = 0; //region 1 if (a < 10) { //region 2 if (b > 10) { //region 3 bar1 (a, b); } else { //region 4 bar2 (a, b); } } else { //region 5 bar3 (a, b); } bar4 (a, b); } I am also wondering whether we should split the live ranges to get better value ranges (for the example shown above)? Thanks, Kugan
fstrict-enums and value ranges in VRP
Hi All, When I compile the following code with g++ using -fstrict-enums and -O2 enum v { OK = 0, NOK = 1, }; int foo0 (enum v a) { if (a > NOK) return 0; return 1; } vrp1 dump looks like: Value ranges after VRP: a.0_1: VARYING _2: [0, 1] a_3(D): VARYING int foo0(v) (v a) { int a.0_1; int _2; : a.0_1 = (int) a_3(D); if (a.0_1 > 1) goto ; else goto ; : : # _2 = PHI <0(2), 1(3)> return _2; } Should we infer value ranges for the enum since this is -fstrict-enums and optimize it? @item -fstrict-enums @opindex fstrict-enums Allow the compiler to optimize using the assumption that a value of enumerated type can only be one of the values of the enumeration (as defined in the C++ standard; basically, a value that can be represented in the minimum number of bits needed to represent all the enumerators). This assumption may not be valid if the program uses a cast to convert an arbitrary integer value to the enumerated type. Thanks, Kugan
Re: anti-ranges of signed variables
Hi, On 12/11/16 06:19, Jakub Jelinek wrote: On Fri, Nov 11, 2016 at 11:51:34AM -0700, Martin Sebor wrote: On 11/11/2016 10:53 AM, Richard Biener wrote: On November 11, 2016 6:34:37 PM GMT+01:00, Martin Sebor wrote: I noticed that variables of signed integer types that are constrained to a specific subrange of values of the type like so: [-TYPE_MAX + N, N] are reported by get_range_info as the anti-range [-TYPE_MAX, TYPE_MIN - 1] for all positive N of the type regardless of the variable's actual range. Basically, such variables are treated the same as variables of the same type that have no range info associated with them at all (such as function arguments or global variables). For example, while a signed char variable between -1 and 126 is represented by VR_ANTI_RANGE [127, -2] ? I'd expect [-1, 126]. And certainly never range-min > range-max Okay. With this code: void f (void *d, const void *s, signed char i) { if (i < -1 || 126 < i) i = -1; __builtin_memcpy (d, s, i); } I see the following in the output of -fdump-tree-vrp: prephitmp_11: ~[127, 18446744073709551614] ... # prephitmp_11 = PHI <_12(3), 18446744073709551615(2)> __builtin_memcpy (d_8(D), s_9(D), prephitmp_11); At some point get_range_info for anti-ranges has been represented by using min larger than max, but later on some extra bit on SSA_NAME has been added. Dunno if the code has been adjusted at that point. Commit changed that and removed it is. commit 0c20fe492bc5b8c9259d21dd2dab03ff5155facb Author: rsandifo Date: Thu Nov 28 16:32:44 2013 + wide-int version of SSA_NAME_ANTI_ALIAS_P patch. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/wide-int@205491 138bc75d-0d04-0410-961f-82ee72b054a4 But looking closely: enum value_range_type { VR_UNDEFINED, VR_RANGE, VR_ANTI_RANGE, VR_VARYING, VR_LAST }; in set_range_info, we have: SSA_NAME_ANTI_RANGE_P (name) = (range_type == VR_ANTI_RANGE); in get_range_info, we have: return SSA_NAME_RANGE_TYPE (name); I think we should change the get_range_info to: diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c index 913d142..f33b9c0 100644 --- a/gcc/tree-ssanames.c +++ b/gcc/tree-ssanames.c @@ -371,7 +371,7 @@ get_range_info (const_tree name, wide_int *min, wide_int *max) *min = ri->get_min (); *max = ri->get_max (); - return SSA_NAME_RANGE_TYPE (name); + return SSA_NAME_RANGE_TYPE (name) ? VR_ANTI_RANGE : VR_RANGE; } Is this OK after testing ? Thanks, Kugan
Re: anti-ranges of signed variables
I think we should change the get_range_info to: diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c index 913d142..f33b9c0 100644 --- a/gcc/tree-ssanames.c +++ b/gcc/tree-ssanames.c @@ -371,7 +371,7 @@ get_range_info (const_tree name, wide_int *min, wide_int *max) *min = ri->get_min (); *max = ri->get_max (); - return SSA_NAME_RANGE_TYPE (name); + return SSA_NAME_RANGE_TYPE (name) ? VR_ANTI_RANGE : VR_RANGE; } OK, this is what SSA_NAME_RANGE_TYPE in tree.h is doing. #define SSA_NAME_RANGE_TYPE(N) \ (SSA_NAME_ANTI_RANGE_P (N) ? VR_ANTI_RANGE : VR_RANGE) So, we shouldn't do it again. Sorry about the noise. Kugan
Using particular register class (like floating point registers) as spill register class
I would like to know if there is anyway we can use registers from particular register class just as spill registers (in places where register allocator would normally spill to stack and nothing more), when it can be useful. In AArch64, in some cases, compiling with -mgeneral-regs-only produces better performance compared not using it. The difference here is that when -mgeneral-regs-only is not used, floating point register are also used in register allocation. Then IRA/LRA has to move them to core registers before performing operations as shown below. . fmovs1, w8 <-- mov w21, 49622 movkw21, 0xca62, lsl 16 add w21, w16, w21 add w21, w21, w2 eor w10, w0, w10 add w10, w21, w10 ror w8, w7, 27 add w7, w10, w8 ror w7, w7, 27 fmovw0, s1 <-- add w7, w0, w7 add w13, w13, w7 fmovw0, s4 <-- add w0, w0, w20 fmovs4, w0 <-- ror w18, w18, 2 fmovw0, s2 <-- add w0, w0, w18 fmovs2, w0 <-- add w12, w12, w27 add w14, w14, w15 mov w15, w24 fmovx0, d3 <-- subsx0, x0, #1 fmovd3, x0 <-- bne .L2 fmovx0, d0 <-- . In this case, costs for allocnos calculated by IRA based on the cost model supplied by the back-end is like: a0(r667,l0) costs: GENERAL_REGS:0,0 FP_LO_REGS:3960,3960 FP_REGS:3960,3960 ALL_REGS:3960,3960 MEM:3960,3960 Thus, changing the cost of floating point register class is not going to help. If I increase further, register allocated will just spill these live ranges to memory and will ignore floating point register in this case. Is there any other back-end in gcc that does anything to improve cases like this, that I can refer to? Thanks in advance, Kugan
Re: Using particular register class (like floating point registers) as spill register class
On 16/05/14 20:40, pins...@gmail.com wrote: > > >> On May 16, 2014, at 3:23 AM, Kugan wrote: >> >> I would like to know if there is anyway we can use registers from >> particular register class just as spill registers (in places where >> register allocator would normally spill to stack and nothing more), when >> it can be useful. >> >> In AArch64, in some cases, compiling with -mgeneral-regs-only produces >> better performance compared not using it. The difference here is that >> when -mgeneral-regs-only is not used, floating point register are also >> used in register allocation. Then IRA/LRA has to move them to core >> registers before performing operations as shown below. > > Can you show the code with fp register disabled? Does it use the stack to > spill? Normally this is due to register to register class costs compared to > register to memory move cost. Also I think it depends on the processor > rather the target. For thunder, using the fp registers might actually be > better than using the stack depending if the stack was in L1. Not all the LDR/STR combination match to fmov. In the testcase I have, aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S -mgeneral-regs-only grep -c "ldr" sha_dgst.s 50 grep -c "str" sha_dgst.s 42 grep -c "fmov" sha_dgst.s 0 aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S grep -c "ldr" sha_dgst.s 42 grep -c "str" sha_dgst.s 31 grep -c "fmov" sha_dgst.s 105 I am not saying that we shouldn’t use floating point register here. But from the above, it seems like register allocator is using it as more like core register (even though the cost mode has higher cost) and then moving the values to core registers before operations. if that is the case, my question is, how do we just make this as spill register class so that we will replace ldr/str with equal number of fmov when it is possible. Thanks, Kugan
Zero/Sign extension elimination using value ranges
This is based on my earlier patch https://gcc.gnu.org/ml/gcc-patches/2013-10/msg00452.html. Before I post the new set of patches, I would like to make sure that I understood review comments and my idea makes sense and acceptable. Please let me know If I am missing anything or my assumptions are wrong. To recap the basic idea, when GIMPLE_ASSIGN stmts are expanded to RTL, if we can prove that zero/sign extension to fit the type is redundant, we can generate RTL without it. For example, when an expression is evaluated and it's value is assigned to variable of type short, the generated RTL currently look similar to (set (reg:SI 110) (zero_extend:SI (subreg:HI (reg:SI 117) 0))). Using value ranges, if we can show that the value of the expression which is present in register 117 is within the limits of short and there is no sign conversion, we do not need to perform zero_extend. Cases to handle here are : 1. Handling NOP_EXPR or CONVERT_EXPR that are in the IL because they are required for type correctness. We have two cases here: A) Mode is smaller than word_mode. This is usually from where the zero/sign extensions are showing up in final assembly. For example : int = (int) short which usually expands to (set (reg:SI ) (sext:SI (subreg:HI (reg:SI We can expand this (set (reg:SI ) (((reg:SI If following is true: 1. Value stored in RHS and LHS are of the same signedness 2. Type can hold the value. i.e., In cases like char = (char) short, we check that the value in short is representable char type. (i.e. look at the value range in RHS SSA_NAME and see if that can be represented in types of LHS without overflowing) Subreg here is not a paradoxical subreg. We are removing the subreg and zero/sign extend here. I am assuming here that QI/HI registers are represented in SImode (basically word_mode) with zero/sign extend is used as in (zero_extend:SI (subreg:HI (reg:SI 117)). B) Mode is larger than word_mode long = (long) int which usually expands to (set:DI (sext:DI (reg:SI))) We have to expand this as paradoxical subreg (set:DI (subreg:DI (reg:SI))) I am not sure that these cases results in actual zero/sign extensions being generated. Therefore I think we should skip this case altogether. 2. Second are promotions required by the target (PROMOTE_MODE) that do arithmetic on wider registers like: char = char + char In this case we will have the value ranges of RHS char1 and char2. We will have to compute the value range of (char1 + char2) in promoted mode (from the values range stored in char1 SSANAME and char2 SSA_NAME) and see if that value range can be represented in LHS type. Once again, if following is true, we can remove the subreg and zero/sign extension in assignment: 1. Value stored in RHS and LHS are of the same signedness 2. Type can hold the value. And also, when LHS is promoted and thus the target is (subreg:XX N), RHS has been expanded in XXmode. Dependent on the value-range and mode XX which is bigger than word mode, set this to a paradoxical subreg of the expanded result. However, since we are only interested in XXmode lesser than word_mode (that is where most of the final zero/sign extension asm are coming from), we don’t have to consider paradoxical subreg here. Does this make sense? Thanks, Kugan
Re: Zero/Sign extension elimination using value ranges
On 20/05/14 16:52, Jakub Jelinek wrote: > On Tue, May 20, 2014 at 12:27:31PM +1000, Kugan wrote: >> 1. Handling NOP_EXPR or CONVERT_EXPR that are in the IL because they >> are required for type correctness. We have two cases here: >> >> A) Mode is smaller than word_mode. This is usually from where the >> zero/sign extensions are showing up in final assembly. >> For example : >> int = (int) short >> which usually expands to >> (set (reg:SI ) >> (sext:SI (subreg:HI (reg:SI >> We can expand this >> (set (reg:SI ) (((reg:SI >> >> If following is true: >> 1. Value stored in RHS and LHS are of the same signedness >> 2. Type can hold the value. i.e., In cases like char = (char) short, we >> check that the value in short is representable char type. (i.e. look at >> the value range in RHS SSA_NAME and see if that can be represented in >> types of LHS without overflowing) >> >> Subreg here is not a paradoxical subreg. We are removing the subreg and >> zero/sign extend here. >> >> I am assuming here that QI/HI registers are represented in SImode >> (basically word_mode) with zero/sign extend is used as in >> (zero_extend:SI (subreg:HI (reg:SI 117)). > > Wouldn't it be better to just set proper flags on the SUBREG based on value > range info (SUBREG_PROMOTED_VAR_P and SUBREG_PROMOTED_UNSIGNED_P)? > Then not only the optimizers could eliminate in zext/sext when possible, but > all other optimizations could benefit from that. Thanks for the comments. Here is an attempt (attached) that sets SUBREG_PROMOTED_VAR_P based on value range into. Is this the good place to do this ? Thanks, Kugan diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index b7f6360..d23ae76 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -3120,6 +3120,60 @@ expand_return (tree retval) } } + +static bool +is_assign_promotion_redundant (struct separate_ops *ops) +{ + double_int type_min, type_max; + double_int min, max; + bool uns = TYPE_UNSIGNED (ops->type); + double_int msb; + + /* We remove extension for integral stmts. */ + if (!INTEGRAL_TYPE_P (ops->type)) +return false; + + if (TREE_CODE_CLASS (ops->code) == tcc_unary) +{ + switch (ops->code) + { + case CONVERT_EXPR: + case NOP_EXPR: + + /* Get the value range. */ + if (TREE_CODE (ops->op0) != SSA_NAME + || POINTER_TYPE_P (TREE_TYPE (ops->op0)) + || get_range_info (ops->op0, &min, &max) != VR_RANGE) + return false; + + msb = double_int_one.rshift (TYPE_PRECISION (TREE_TYPE (ops->op0))); + if (!uns && min.cmp (msb, uns) == 1 + && max.cmp (msb, uns) == 1) + { + min = min.sext (TYPE_PRECISION (TREE_TYPE (ops->op0))); + max = max.sext (TYPE_PRECISION (TREE_TYPE (ops->op0))); + } + + /* Signedness of LHS and RHS should match or value range of RHS +should be all positive values to make zero/sign extension redundant. */ + if ((uns != TYPE_UNSIGNED (TREE_TYPE (ops->op0))) + && (min.cmp (double_int_zero, TYPE_UNSIGNED (TREE_TYPE (ops->op0))) == -1)) + return false; + + type_max = tree_to_double_int (TYPE_MAX_VALUE (ops->type)); + type_min = tree_to_double_int (TYPE_MIN_VALUE (ops->type)); + + /* If rhs value range fits lhs type, zero/sign extension is + redundant. */ + if (max.cmp (type_max, uns) != 1 + && (type_min.cmp (min, uns)) != 1) + return true; + } +} + + return false; +} + /* A subroutine of expand_gimple_stmt, expanding one gimple statement STMT that doesn't require special handling for outgoing edges. That is no tailcalls and no GIMPLE_COND. */ @@ -3240,6 +3294,12 @@ expand_gimple_stmt_1 (gimple stmt) } ops.location = gimple_location (stmt); + if (promoted && is_assign_promotion_redundant (&ops)) + { + promoted = false; + SUBREG_PROMOTED_VAR_P (target) = 0; + } + /* If we want to use a nontemporal store, force the value to register first. If we store into a promoted register, don't directly expand to target. */
Re: Zero/Sign extension elimination using value ranges
On 21/05/14 17:05, Jakub Jelinek wrote: > On Wed, May 21, 2014 at 12:53:47PM +1000, Kugan wrote: >> On 20/05/14 16:52, Jakub Jelinek wrote: >>> On Tue, May 20, 2014 at 12:27:31PM +1000, Kugan wrote: >>>> 1. Handling NOP_EXPR or CONVERT_EXPR that are in the IL because they >>>> are required for type correctness. We have two cases here: >>>> >>>> A) Mode is smaller than word_mode. This is usually from where the >>>> zero/sign extensions are showing up in final assembly. >>>> For example : >>>> int = (int) short >>>> which usually expands to >>>> (set (reg:SI ) >>>> (sext:SI (subreg:HI (reg:SI >>>> We can expand this >>>> (set (reg:SI ) (((reg:SI >>>> >>>> If following is true: >>>> 1. Value stored in RHS and LHS are of the same signedness >>>> 2. Type can hold the value. i.e., In cases like char = (char) short, we >>>> check that the value in short is representable char type. (i.e. look at >>>> the value range in RHS SSA_NAME and see if that can be represented in >>>> types of LHS without overflowing) >>>> >>>> Subreg here is not a paradoxical subreg. We are removing the subreg and >>>> zero/sign extend here. >>>> >>>> I am assuming here that QI/HI registers are represented in SImode >>>> (basically word_mode) with zero/sign extend is used as in >>>> (zero_extend:SI (subreg:HI (reg:SI 117)). >>> >>> Wouldn't it be better to just set proper flags on the SUBREG based on value >>> range info (SUBREG_PROMOTED_VAR_P and SUBREG_PROMOTED_UNSIGNED_P)? >>> Then not only the optimizers could eliminate in zext/sext when possible, but >>> all other optimizations could benefit from that. >> >> Thanks for the comments. Here is an attempt (attached) that sets >> SUBREG_PROMOTED_VAR_P based on value range into. Is this the good place >> to do this ? > > But you aren't setting it in your patch in any way, you are just resetting > it instead. The thing is, start with a testcase where you get that > (subreg:HI (reg:SI)) as the RTL of some SSA_NAME (is that the case on ARM?, > I believe on e.g. i?86/x86_64 you'd just get (reg:HI) instead and thus you > can't take advantage of that), and at that point where it is created check > the range info and if it is properly sign or zero extended, set > SUBREG_PROMOTED_VAR_P and SUBREG_PROMOTED_UNSIGNED_SET. Here is another attempt (a quick hack patch is attached). Is this a reasonable direction? I think I will have to look for other places where SUBREG_PROMOTED_UNSIGNED_P are used for possible optimisations. Before that I want to make sure I am on the right track. > Note that right now we use 2 bits for the latter, which encode values > -1 (weirdo pointer extension), 0 (sign extension), 1 (zero extension). > Perhaps it would be nice to allow encoding value 2 (zero and sign extension) > for cases where the range info tells you that the value is both zero and > sign extended (i.e. minimum of range is >= 0 and maximum is <= signed type > maximum). Do you suggest changing rtx_def to achieve this like the following to be able to store 2 in SUBREG_PROMOTED_UNSIGNED_SET? probably not. - unsigned int unchanging : 1; + unsigned int unchanging : 2; Thanks, Kugan diff --git a/gcc/expr.c b/gcc/expr.c index 2868d9d..15183fa 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -328,7 +328,8 @@ convert_move (rtx to, rtx from, int unsignedp) if (GET_CODE (from) == SUBREG && SUBREG_PROMOTED_VAR_P (from) && (GET_MODE_PRECISION (GET_MODE (SUBREG_REG (from))) >= GET_MODE_PRECISION (to_mode)) - && SUBREG_PROMOTED_UNSIGNED_P (from) == unsignedp) + && (SUBREG_PROMOTED_UNSIGNED_P (from) == 2 + || SUBREG_PROMOTED_UNSIGNED_P (from) == unsignedp)) from = gen_lowpart (to_mode, from), from_mode = to_mode; gcc_assert (GET_CODE (to) != SUBREG || !SUBREG_PROMOTED_VAR_P (to)); @@ -9195,6 +9196,51 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode, } #undef REDUCE_BIT_FIELD +static bool +is_value_extended (tree lhs, enum machine_mode rhs_mode, bool rhs_uns) +{ + wide_int type_min, type_max; + wide_int min, max; + unsigned int prec; + tree lhs_type; + bool lhs_uns; + + if (TREE_CODE (lhs) != SSA_NAME) +return false; + + lhs_type = lang_hooks.types.type_for_mode (rhs_mode, rhs_uns); + lhs_uns = TYPE_UNSIGNED (TREE_TYPE (lhs)); + + /* We remove extension for integrals. */ + if (!INTEGRAL_TYPE_P (TREE_TYPE (lhs))) +return false; + + /* Get the value range. */ + if (POINTER_TYPE_P (TREE_TYPE (lhs)) +
Re: Zero/Sign extension elimination using value ranges
ED: \ _rtx->volatil = 1;\ _rtx->unchanging = 1; \ break;\ }\ } while (0) #define SUBREG_PROMOTED_GET(RTX)\ (2 * ((RTL_FLAG_CHECK1 ("SUBREG_PROMOTED_GET", (RTX), SUBREG))->volatil)\ + (RTX)->unchanging - 1) #define SUBREG_PROMOTED_SIGNED_P(RTX) \ RTL_FLAG_CHECK1 ("SUBREG_PROMOTED_SIGNED_P", (RTX), SUBREG)->volatil)\ + (RTX)->unchanging) == 0) ? 0 : ((RTX)->unchanging == 1)) #define SUBREG_PROMOTED_UNSIGNED_P(RTX)\ RTL_FLAG_CHECK1 ("SUBREG_PROMOTED_UNSIGNED_P", (RTX), SUBREG)->volatil)\ + (RTX)->unchanging) == 0) ? -1 : ((RTX)->volatil == 1)) #define SUBREG_CHECK_PROMOTED_SIGN(RTX, SIGN) \ ((SIGN) ? SUBREG_PROMOTED_UNSIGNED_P((RTX))\ : SUBREG_PROMOTED_SIGNED_P((RTX))) \ Does this look reasonable? Thanks, Kugan diff --git a/gcc/calls.c b/gcc/calls.c index 78fe7d8..a1e7468 100644 --- a/gcc/calls.c +++ b/gcc/calls.c @@ -1484,8 +1484,11 @@ precompute_arguments (int num_actuals, struct arg_data *args) args[i].initial_value = gen_lowpart_SUBREG (mode, args[i].value); SUBREG_PROMOTED_VAR_P (args[i].initial_value) = 1; - SUBREG_PROMOTED_UNSIGNED_SET (args[i].initial_value, - args[i].unsignedp); + + if (is_promoted_for_type (args[i].tree_value, mode, !args[i].unsignedp)) + SUBREG_PROMOTED_SET (args[i].initial_value, SRP_SIGNED_AND_UNSIGNED); + else + SUBREG_PROMOTED_SET (args[i].initial_value, args[i].unsignedp); } } } @@ -3365,7 +3368,8 @@ expand_call (tree exp, rtx target, int ignore) target = gen_rtx_SUBREG (TYPE_MODE (type), target, offset); SUBREG_PROMOTED_VAR_P (target) = 1; - SUBREG_PROMOTED_UNSIGNED_SET (target, unsignedp); + SUBREG_PROMOTED_SET (target, unsignedp); + } /* If size of args is variable or this was a constructor call for a stack diff --git a/gcc/expr.c b/gcc/expr.c index d99bc1e..7a1a2b9 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -328,7 +328,7 @@ convert_move (rtx to, rtx from, int unsignedp) if (GET_CODE (from) == SUBREG && SUBREG_PROMOTED_VAR_P (from) && (GET_MODE_PRECISION (GET_MODE (SUBREG_REG (from))) >= GET_MODE_PRECISION (to_mode)) - && SUBREG_PROMOTED_UNSIGNED_P (from) == unsignedp) + && (SUBREG_CHECK_PROMOTED_SIGN (from, unsignedp))) from = gen_lowpart (to_mode, from), from_mode = to_mode; gcc_assert (GET_CODE (to) != SUBREG || !SUBREG_PROMOTED_VAR_P (to)); @@ -702,7 +702,7 @@ convert_modes (enum machine_mode mode, enum machine_mode oldmode, rtx x, int uns if (GET_CODE (x) == SUBREG && SUBREG_PROMOTED_VAR_P (x) && GET_MODE_SIZE (GET_MODE (SUBREG_REG (x))) >= GET_MODE_SIZE (mode) - && SUBREG_PROMOTED_UNSIGNED_P (x) == unsignedp) + && (SUBREG_CHECK_PROMOTED_SIGN (x, unsignedp))) x = gen_lowpart (mode, SUBREG_REG (x)); if (GET_MODE (x) != VOIDmode) @@ -4375,6 +4375,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size, { /* Handle calls that pass values in multiple non-contiguous locations. The Irix 6 ABI has examples of this. */ + if (GET_CODE (reg) == PARALLEL) emit_group_load (reg, x, type, -1); else @@ -5201,8 +5202,7 @@ store_expr (tree exp, rtx target, int call_param_p, bool nontemporal) && GET_MODE_PRECISION (GET_MODE (target)) == TYPE_PRECISION (TREE_TYPE (exp))) { - if (TYPE_UNSIGNED (TREE_TYPE (exp)) - != SUBREG_PROMOTED_UNSIGNED_P (target)) + if (!(SUBREG_CHECK_PROMOTED_SIGN (target, TYPE_UNSIGNED (TREE_TYPE (exp) { /* Some types, e.g. Fortran's logical*4, won't have a signed version, so use the mode instead. */ @@ -9209,6 +9209,52 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode, } #undef REDUCE_BIT_FIELD +/* Return TRUE if value in RHS is already zero/sign extended for lhs type + (type here is the combination of LHS_MODE and LHS_UNS) using value range + information stored in RHS. Return FALSE otherwise. */ +bool +is_promoted_for_type (tree rhs, enum machine_mode lhs_mode, bool lhs_uns) +{ + wide_int type_min, type_max; + wide_int min, max; + unsigned int prec; + tree lhs_type; + bool rhs_uns; + + if (flag_wrapv + || (rhs == NULL_TREE) + || (TREE_CODE (rhs) != SSA_NAME) + || !INTEGRAL_TYPE_P (TREE_TYPE (rhs)) + || POINTER_TYPE_P (TREE_TYPE (rhs)) + || (get_range_info (rhs, &min, &max) !
Question about LRA in aarch64_be-none-elf
Hi All, I am looking at a regression (in aarch64_be-none-elf-gcc -Og and test-case attached) where a TImode register is assigned two DImode values and then passed to the __multf3 as argument. When, the intermediate pseudo(TImode) is assigned a FP_REG to hold this value, the regression shows up. Difference in asm to the one working and not working is below. fmovd1, x20 fmovv1.d[1], x19 + str q0, [x29, 64] + str x22, [x29, 64] fmovd0, x21 - fmovv0.d[1], x22 bl __multf3 When LRA assigns one of the DImode value to TImode register, it spills the TImode register into memory. Appends the DImode and then reloads (as shown below in the dump). However, It is not setting up it in the right place of TImode and due to that one of the moves becomes dead and removed by dce. If I compile the test-case with fno-dce, I get the following asm. If I compile with -fno-dce. fmovd1, x3 fmovv1.d[1], x19 str q0, [x29, 64] str x19, [x29, 64] ldr q0, [x29, 64] fmovd0, x3 bl __addtf3 What is causing the LRA to generate moves like this? Thanks, Kugan t.c.214r.reload --- (insn 88 87 133 3 (clobber (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 -1 (nil)) ---89, 134 and 91 stores (insn 133 88 89 3 (set (mem/c:TI (plus:DI (reg/f:DI 29 x29) (const_int 64 [0x40])) [0 %sfp+-16 S16 A128]) (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 37 {*movti_aarch64} (nil)) (insn 89 133 134 3 (set (mem/c:DI (plus:DI (reg/f:DI 29 x29) (const_int 64 [0x40])) [0 %sfp+-16 S8 A128]) (reg:DI 19 x19 [orig:102 d ] [102])) t.c:11 34 {*movdi_aarch64} (nil)) (insn 134 89 90 3 (set (reg:TI 32 v0 [orig:108 d+-8 ] [108]) (mem/c:TI (plus:DI (reg/f:DI 29 x29) (const_int 64 [0x40])) [0 %sfp+-16 S16 A128])) t.c:11 37 {*movti_aarch64} (nil)) (insn 90 134 91 3 (set (reg:DI 32 v0 [orig:108 d ] [108]) (reg:DI 20 x20 [orig:105 d+8 ] [105])) t.c:11 34 {*movdi_aarch64} (nil)) (insn 91 90 15 3 (set (reg:TF 32 v0) (reg:TF 32 v0 [orig:108 d+-8 ] [108])) t.c:11 40 {*movtf_aarch64} (nil)) (call_insn/u 15 91 129 3 (parallel [ (set (reg:TF 32 v0) (call (mem:DI (symbol_ref:DI ("__addtf3") [flags 0x41]) [0 S8 A8]) (const_int 0 [0]))) (use (const_int 0 [0])) (clobber (reg:DI 30 x30)) ]) t.c:11 28 {*call_value_symbol} (expr_list:REG_EH_REGION (const_int -2147483648 [0x8000]) (nil)) (expr_list (use (reg:TF 33 v1)) (expr_list (use (reg:TF 32 v0)) (nil t.c.228r.cprop_hardreg -- (insn 88 174 133 3 (clobber (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 -1 (nil)) (insn 133 88 89 3 (set (mem/c:TI (plus:DI (reg/f:DI 29 x29) (const_int 64 [0x40])) [0 %sfp+-16 S16 A128]) (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 37 {*movti_aarch64} (expr_list:REG_DEAD (reg:TI 32 v0 [orig:108 d+-8 ] [108]) (nil))) (insn 89 133 134 3 (set (mem/c:DI (plus:DI (reg/f:DI 29 x29) (const_int 64 [0x40])) [0 %sfp+-16 S8 A128]) (reg:DI 19 x19 [orig:102 d ] [102])) t.c:11 34 {*movdi_aarch64} (nil)) (insn 134 89 90 3 (set (reg:TI 32 v0 [orig:108 d+-8 ] [108]) (mem/c:TI (plus:DI (reg/f:DI 29 x29) (const_int 64 [0x40])) [0 %sfp+-16 S16 A128])) t.c:11 37 {*movti_aarch64} (expr_list:REG_UNUSED (reg:TI 32 v0 [orig:108 d+-8 ] [108]) (nil))) (insn 90 134 15 3 (set (reg:DI 32 v0 [orig:108 d ] [108]) (reg:DI 3 x3 [orig:105 d+8 ] [105])) t.c:11 34 {*movdi_aarch64} (nil)) (call_insn/u 15 90 175 3 (parallel [ (set (reg:TF 32 v0) (call (mem:DI (symbol_ref:DI ("__addtf3") [flags 0x41]) [0 S8 A8]) (const_int 0 [0]))) (use (const_int 0 [0])) (clobber (reg:DI 30 x30)) ]) t.c:11 28 {*call_value_symbol} (expr_list:REG_DEAD (reg:TF 33 v1) (expr_list:REG_EH_REGION (const_int -2147483648 [0x8000]) (nil))) (expr_list (use (reg:TF 33 v1)) (expr_list (use (reg:TF 32 v0)) (nil t.c.229r.rtl_dce --- (insn 88 174 133 3 (clobber (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 -1 (nil)) (insn 133 88 89 3 (set (mem/c:TI (plus:DI (reg/f:DI 29 x29) (const_int 64 [0x40])) [0 %sfp+-16 S16 A128]) (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 37 {*movti_aarch64} (expr_list:REG_DEAD (reg:TI 32 v0 [orig:108 d+-8 ] [108]) (nil))) (insn 89 133 90 3 (set (mem/c:DI (plus:DI (reg/f:DI 29 x29) (const_int 64 [0x40])) [0 %sfp+-16 S8 A128]) (reg:DI 19 x19 [orig:102 d ] [102])) t.c:11 34 {*movdi_aarch64} (nil)) (insn 90 89 15 3 (set (reg:DI 32 v
reg_equiv_mem and reg_equiv_address are NULL for true_regnum == -1
Hi All, I am looking at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62254. Here, in arm_reload_in, for REG_P (ref) which has true_regnum (ref) == -1, both reg_equiv_mem (REG_P (ref)) and reg_equiv_address (REG_P (ref)) are NULL. Can this happen? Thanks, Kugan
Re: A Question About LRA/reload
On 09/12/14 20:37, lin zuojian wrote: > Hi, > I have read ira/lra code for a while, but still fails to understand > their relationship. The main question is why ira do color so early? > lra pass will do the assignment anyway. Sorry if I mess up coloring > and hard register assignment, but I think it's better to get job > done after lra elimiation, inheriation, ... IRA does the register allocation and LRA matches insn constraints. Therefore IRA has to do the coloring. LRA, in the process matching constraints may change some of these assignment. Please look at the following links for more info. https://ols.fedoraproject.org/GCC/Reprints-2007/makarov-reprint.pdf https://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=get&target=Local_Register_Allocator_Project_Detail.pdf Thanks, Kugan
Re: A Question About LRA/reload
On 09/12/14 21:14, lin zuojian wrote: > Hi Kugan, > I have read these pdfs. My question is LRA will change the insns, so > why brother do the coloring so early. Changing the insns can > generates new pseudo registers, so they needs to re-assign. Is that > correct? Hi, IRA's job here is register allocation and LRA's job is matching the constraints. For example, LRA might have to reload a value into a different register class to match a constraint. To do that, LRA will need a free register from certain register class. In order to get that free register, LRA might have to change the IRA's allocation decision. LRA needs the registrar allocation (that is the coloring info) and spilled pseudo information to see if the constraints can be matched. It iteratively have to change the insns till all the constraints are matched. To get all the details you have to look at the code. Thanks, Kugan
Re: issue with placing includes in gcc-plugin.h
On 14/01/15 21:24, Prathamesh Kulkarni wrote: > On 14 January 2015 at 14:37, Richard Biener wrote: >> On Wed, 14 Jan 2015, Prathamesh Kulkarni wrote: >> >>> Hi, >>> I am having an issue with placing includes of expr.h in gcc-plugin.h. >>> rtl.h is required to be included before expr.h, so I put it in gcc-plugin.h. >>> However the front-ends then fail to build because rtl.h is not allowed >>> in front-ends, >>> and the front-ends include gcc-plugin.h (via plugin.h). >>> >>> For instance ada/gcc-interface/misc.c failed to build with following error: >>> In file included from ../../gcc/gcc/gcc-plugin.h:64:0, >>> from ../../gcc/gcc/plugin.h:23, >>> from ../../gcc/gcc/ada/gcc-interface/misc.c:53: >>> ../../gcc/gcc/rtl.h:20:9: error: attempt to use poisoned "GCC_RTL_H" >>> >>> However rtl.h is required to be included before expr.h, so we cannot skip >>> including rtl.h in gcc-plugin.h. How do we get around this ? >>> As a temporary hack, could we #undef IN_GCC_FRONTEND in gcc-plugin.h ? >>> java/builtins.c does this to include expr.h. >> >> Err - obviously nothing in GCC itself should include gcc-plugin.h, >> only plugins should. Do we tell plugins that they should include >> plugin.h?! Why is the include in there? >> >> I'd simply remove it > That doesn't work. > For instance removing plugin.h include from c/c-decl.h resulted in > following build errors: > ../../gcc/gcc/c/c-decl.c: In function \u2018void finish_decl(tree, > location_t, tree, tree, tree)\u2019: > ../../gcc/gcc/c/c-decl.c:4990:27: error: > \u2018PLUGIN_FINISH_DECL\u2019 was not declared in this scope > ../../gcc/gcc/c/c-decl.c:4990:51: error: > \u2018invoke_plugin_callbacks\u2019 was not declared in this scope > ../../gcc/gcc/c/c-decl.c: In function \u2018void finish_function()\u2019: > ../../gcc/gcc/c/c-decl.c:9009:29: error: > \u2018PLUGIN_PRE_GENERICIZE\u2019 was not declared in this scope > ../../gcc/gcc/c/c-decl.c:9009:58: error: > \u2018invoke_plugin_callbacks\u2019 was not declared in this scope > make[3]: *** [c/c-decl.o] Error 1 > make[2]: *** [all-stage1-gcc] Error 2 > make[1]: *** [stage1-bubble] Error 2 > make: *** [all] Error 2 > > Why do the front-ends require to include plugin.h ? C/C++ Front-end seems to have callbacks to process declarations. Please look at https://gcc.gnu.org/ml/gcc-patches/2010-04/msg00780.html which added callback PLUGIN_FINISH_DECL. Thanks, Kugan
loop_latch_edge is NULL during jump threading
In linaro-4.9-branch, with the following (reduced) test case, I run into a situation where loop_latch_edge is NULL during jump threading. I am wondering if this a possible during jump threading or the error lies some where else? I can't reproduce it with the trunk. int a; fn1() { enum { UQSTRING, SQSTRING, QSTRING } b = UQSTRING; while (1) switch (a) { case '\'': b = QSTRING; default: switch (b) case UQSTRING: return; b = SQSTRING; } } x.c:2:1: internal compiler error: Segmentation fault fn1() { ^ 0x83694f crash_signal /home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/toplev.c:337 0x96d8a8 thread_block_1 /home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-ssa-threadupdate.c:797 0x96da3e thread_block /home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-ssa-threadupdate.c:941 0x96e59c thread_through_all_blocks(bool) /home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-ssa-threadupdate.c:1866 0x9d77e9 finalize_jump_threads /home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-vrp.c:9709 0x9d77e9 execute_vrp /home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-vrp.c:9864 0x9d77e9 execute /home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-vrp.c:9938 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. If I apply the following patch, segfault goes away. Is this aright approach? diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index d1b289f..0bcef35 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -794,6 +794,8 @@ thread_block_1 (basic_block bb, bool noloop_only, bool joiners) if (loop->header == bb) { e = loop_latch_edge (loop); + if (!e) + return false; vec *path = THREAD_PATH (e); if (path @@ -1114,6 +1116,8 @@ thread_through_loop_header (struct loop *loop, bool may_peel_loop_headers) basic_block tgt_bb, atgt_bb; enum bb_dom_status domst; + if (!latch) +return false; /* We have already threaded through headers to exits, so all the threading requests now are to the inside of the loop. We need to avoid creating irreducible regions (i.e., loops with more than one entry block), and Thanks, Kugan
Re: loop_latch_edge is NULL during jump threading
On 02/03/15 15:29, Jeff Law wrote: > On 03/01/15 16:32, Kugan wrote: >> In linaro-4.9-branch, with the following (reduced) test case, I run into >> a situation where loop_latch_edge is NULL during jump threading. I am >> wondering if this a possible during jump threading or the error lies >> some where else? I can't reproduce it with the trunk. > There's really no way to tell without a lot more information. If you > can't reproduce on the 4.9 branch or the trunk, then you're likely going > to have to do the real digging. > > THe first thing I tend to do with these things is to draw the CFG and > annotate it with all the jump threading paths. Then I look at how the > jump threading paths interact with each other and the loop structure, > then reconcile that with the constraints placed on threading in > tree-ssa-threadupdate.c. Thanks Jeff. I will do the same. Kugan
Re: Combine changes ASHIFT into mult for non-MEM rtx
On 02/04/15 20:39, Bin.Cheng wrote: > Hi, > In function make_compound_operation, the code/comment says: > > case ASHIFT: > /* Convert shifts by constants into multiplications if inside > an address. */ > if (in_code == MEM && CONST_INT_P (XEXP (x, 1)) > && INTVAL (XEXP (x, 1)) < HOST_BITS_PER_WIDE_INT > && INTVAL (XEXP (x, 1)) >= 0 > && SCALAR_INT_MODE_P (mode)) > { > > > Right now, it changes ASHIFT in any SET into mult because of below code: > > /* Select the code to be used in recursive calls. Once we are inside an > address, we stay there. If we have a comparison, set to COMPARE, > but once inside, go back to our default of SET. */ > > next_code = (code == MEM ? MEM >: ((code == PLUS || code == MINUS) > && SCALAR_INT_MODE_P (mode)) ? MEM // <bogus? >: ((code == COMPARE || COMPARISON_P (x)) > && XEXP (x, 1) == const0_rtx) ? COMPARE >: in_code == COMPARE ? SET : in_code); > > This seems an overlook to me. The effect is all targets have to > support the generated expression in the corresponding pattern. Some > times the generated expression is just too stupid and missed. For > example below insn is tried by combine: > (set (reg:SI 79 [ D.2709 ]) > (plus:SI (subreg:SI (sign_extract:DI (mult:DI (reg:DI 1 x1 [ i ]) > (const_int 2 [0x2])) > (const_int 17 [0x11]) > (const_int 0 [0])) 0) > (reg:SI 0 x0 [ a ]))) > > > It actually equals to > (set (reg/i:SI 0 x0) > (plus:SI (ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ])) > (const_int 1 [0x1])) > (reg:SI 0 x0 [ a ]))) > > equals to below instruction on AARCH64: > addw0, w0, w1, sxth 1 > > > Because of the existing comment, also because it will make backend > easier (I suppose), is it reasonable to fix this behavior in > combine.c? Another question is, if we are going to make the change, > how many targets might be affected? > I think https://gcc.gnu.org/ml/gcc-patches/2015-01/msg01020.html is related to this. Thanks, kugan
Re: optimization question
On 19/05/15 12:58, mark maule wrote: > Thank you for taking a look Martin. I will attempt to pare this down, > provide a sample with typedefs/macros expanded, etc. and repost to > gcc-help. To address a couple of your points: If you haven’t already, you can have a look at https://gcc.gnu.org/wiki/A_guide_to_testcase_reduction There are some examples/techniques to create a reduced test-case that reproduces it. Thanks, Kugan
Re: LTO crashes with fortran code in SPEC CPU 2006
On 15/01/17 15:57, Andrew Pinski wrote: Just this is just an FYI until I reduce the testcases but 5 benchmarks in SPEC CPU 2006 with fortran code is causing an ICE on aarch64-linux-gnu with -Ofast -flto -mcpu=thunderx2t99 -fno-aggressive-loop-optimizations -funroll-loops: lto1: internal compiler error: in ipa_get_type, at ipa-prop.h:448 0x107c58f ipa_get_type ../../gcc/gcc/ipa-prop.h:448 0x107c58f propagate_constants_across_call ../../gcc/gcc/ipa-cp.c:2259 0x1080f4f propagate_constants_topo ../../gcc/gcc/ipa-cp.c:3170 0x1080f4f ipcp_propagate_stage ../../gcc/gcc/ipa-cp.c:3267 0x1081fcb ipcp_driver ../../gcc/gcc/ipa-cp.c:4997 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <http://gcc.gnu.org/bugs.html> for instructions. lto-wrapper: fatal error: gfortran returned 1 exit status I don't know when this started as I am just starting to run SPEC CPU 2006 fp side with my spec cpu 2006 config. I am seeing this too for aatch64 with -O3 -flto. It did work few weeks back. This must be a new bug. Thanks, Kugan Thanks, Andrew
[PR43721] Failure to optimise (a/b) and (a%b) into single call
Hi, I am attempting to fix Bug 43721 - Failure to optimise (a/b) and (a%b) into single __aeabi_idivmod call (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43721) execute_cse_sincos tree level pass does similar cse so I attempted to use similar approach here. Div/mod cse is not really using built-in functions though at this level. For the case of div and mod operations, after CSE is performed, there isnt a way to represent the resulting stament in gimple. We will endup with divmod taking two arguments and returning double the size of one arguments in the three address format (divmod will return reminder and quotient so the return type is double the size of argument type). Since GIMPLE_ASSIGN will result in type checking failure in this case, I atempted use built-in functions (GIMPLE_CALL to represent the runtime library call). Name for the function here is target specific and can be obtained from sdivmod_optab so the builtin function name defined in tree level is not used. I am not entirelt sure this is the right approach so I am attaching the first cut of the patch to get your feedback and understand the right approach to this problem. Thank you, Kugan diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def index 2634ecc..21c483a 100644 --- a/gcc/builtin-types.def +++ b/gcc/builtin-types.def @@ -250,6 +250,10 @@ DEF_FUNCTION_TYPE_2 (BT_FN_INT_CONST_STRING_FILEPTR, BT_INT, BT_CONST_STRING, BT_FILEPTR) DEF_FUNCTION_TYPE_2 (BT_FN_INT_INT_FILEPTR, BT_INT, BT_INT, BT_FILEPTR) +DEF_FUNCTION_TYPE_2 (BT_FN_LONGLONG_INT_INT, + BT_LONGLONG, BT_INT, BT_INT) +DEF_FUNCTION_TYPE_2 (BT_FN_ULONGLONG_UINT_UINT, + BT_ULONGLONG, BT_UINT, BT_UINT) DEF_FUNCTION_TYPE_2 (BT_FN_VOID_PTRMODE_PTR, BT_VOID, BT_PTRMODE, BT_PTR) DEF_FUNCTION_TYPE_2 (BT_FN_VOID_PTR_PTRMODE, diff --git a/gcc/builtins.c b/gcc/builtins.c index 402bb1f..1cae2bb 100644 --- a/gcc/builtins.c +++ b/gcc/builtins.c @@ -1876,7 +1876,9 @@ mathfn_built_in_1 (tree type, enum built_in_function fn, bool implicit_p) CASE_MATHFN (BUILT_IN_Y0) CASE_MATHFN (BUILT_IN_Y1) CASE_MATHFN (BUILT_IN_YN) - + case BUILT_IN_DIVMOD: + case BUILT_IN_UDIVMOD: +return builtin_decl_explicit (fn); default: return NULL_TREE; } @@ -2449,6 +2451,57 @@ expand_builtin_interclass_mathfn (tree exp, rtx target) return NULL_RTX; } +/* Expand a call to the builtin divmod function to + library call. */ +static rtx +expand_builtin_divmod (tree exp, rtx target) +{ + rtx op0, op1; + enum machine_mode mode; + tree arg0, arg1; + rtx libval; + rtx libfunc; + rtx insns; + bool is_unsigned; + + arg0 = CALL_EXPR_ARG (exp, 0); + arg1 = CALL_EXPR_ARG (exp, 1); + + mode = TYPE_MODE (TREE_TYPE (arg0)); + is_unsigned = TYPE_UNSIGNED (TREE_TYPE (arg0)); + + /* Get the libcall. */ + libfunc = optab_libfunc (is_unsigned ? udivmod_optab : sdivmod_optab, mode); + gcc_assert (libfunc); + + op0 = expand_normal (arg0); + op1 = expand_normal (arg1); + + if (MEM_P (op0)) +op0 = force_reg (mode, op0); + if (MEM_P (op1)) +op1 = force_reg (mode, op1); + + /* The value returned by the library function will have twice as + many bits as the nominal MODE. */ + machine_mode libval_mode += smallest_mode_for_size (2 * GET_MODE_BITSIZE (mode), + MODE_INT); + start_sequence (); + libval = emit_library_call_value (libfunc, NULL_RTX, LCT_CONST, +libval_mode, 2, +op0, mode, +op1, mode); + insns = get_insns (); + end_sequence (); + /* Move into the desired location. */ + if (target != const0_rtx) +emit_libcall_block (insns, target, libval, +gen_rtx_fmt_ee (is_unsigned ? UMOD : MOD, mode, op0, op1)); + + return target; +} + /* Expand a call to the builtin sincos math function. Return NULL_RTX if a normal call should be emitted rather than expanding the function in-line. EXP is the expression that is a call to the builtin @@ -5977,6 +6030,13 @@ expand_builtin (tree exp, rtx target, rtx subtarget, enum machine_mode mode, return target; break; +case BUILT_IN_DIVMOD: +case BUILT_IN_UDIVMOD: + target = expand_builtin_divmod (exp, target); + if (target) +return target; + break; + CASE_FLT_FN (BUILT_IN_SINCOS): if (! flag_unsafe_math_optimizations) break; diff --git a/gcc/builtins.def b/gcc/builtins.def index 91879a6..7664700 100644 --- a/gcc/builtins.def +++ b/gcc/builtins.def @@ -599,6 +599,8 @@ DEF_C99_BUILTIN(BUILT_IN_VSCANF, "vscanf", BT_FN_INT_CONST_STRING_VALIST DEF_C99_BUILTIN(BUILT_IN_VSNPRINTF, "vsnprintf", BT_FN_INT_STRING_SIZE_CONST_STRING_VALIST_ARG, ATTR_FORMAT_PRINTF_NOTHROW_3_0) DEF_LIB_BUILTIN(BUILT_IN_VSPRINTF, "vsprintf", BT_FN_INT_STRING_CONST_STRING_VALIST_ARG,
Re: [PR43721] Failure to optimise (a/b) and (a%b) into single call
On 17/06/13 19:07, Richard Biener wrote: On Mon, 17 Jun 2013, Kugan wrote: Hi, I am attempting to fix Bug 43721 - Failure to optimise (a/b) and (a%b) into single __aeabi_idivmod call (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43721) execute_cse_sincos tree level pass does similar cse so I attempted to use similar approach here. Div/mod cse is not really using built-in functions though at this level. The issue with performing the transform at the same time as we transform SINCOS is that the vectorizer will now no longer be able to vectorize these loops. It would need to be taught how to handle the builtin calls (basically undo the transformation, I don't know of any ISA that can do vectorized combined div/mod). Which means it should rather be done at the point we CSE reciprocals (which also replaces computes with builtin target function calls). Thanks Richard. Since execute_cse_reciprocals is handling reciprocals only, I added another pass to do divmod. Is that OK? For the case of div and mod operations, after CSE is performed, there isnt a way to represent the resulting stament in gimple. We will endup with divmod taking two arguments and returning double the size of one arguments in the three address format (divmod will return reminder and quotient so the return type is double the size of argument type). Since GIMPLE_ASSIGN will result in type checking failure in this case, I atempted use built-in functions (GIMPLE_CALL to represent the runtime library call). Name for the function here is target specific and can be obtained from sdivmod_optab so the builtin function name defined in tree level is not used. I am not entirelt sure this is the right approach so I am attaching the first cut of the patch to get your feedback and understand the right approach to this problem. If we don't want to expose new builtins to the user (I'm not sure we want that), then using "internal functions" is an easier way to avoid these issues (see gimple.h and internal-fn.(def|h)). I have now changed to use internal functions. Thanks for that. Generally the transform looks useful to me as it moves forward with the general idea of moving pattern recognition done during RTL expansion to an earlier place. For the use of a larger integer type and shifts to represent the modulo and division result I don't think that's the very best idea. Instead resorting to a complex integer type as return value looks more appealing (similar to sincos using cexpi here). That way you also avoid the ugly hard-coding of bit-sizes. I have changed it to use complex integers now. + if (HAVE_divsi3 + || (GET_MODE_BITSIZE (TYPE_MODE (type)) != 32) watch out for types whose TYPE_PRECISION is not the bitsize of their mode. Also it should be GET_MODE_PRECISION here. + || !optab_libfunc (TYPE_UNSIGNED (type)? udivmod_optab : sdivmod_optab, +TYPE_MODE (type))) targets that use a libfunc should also get this optimization, as it always removes computations. I think the proper test is for whether the target can do division and/or modulus without using a libfunc, not whether there is a divmod optab/libfunc. I guess best way to do is by defining a target hook and let the target define the required behaviour. Is that what you had in mind? I have attached a modified patch with these changes. Others knowing this piece of the compiler better may want to comment here, of course. Thanks, Richard. Thanks, Kugan diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index f030b56..3fae80e 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -11375,3 +11375,8 @@ It returns true if the target supports GNU indirect functions. The support includes the assembler, linker and dynamic linker. The default value of this hook is based on target's libc. @end deftypefn + +@deftypefn {Target Hook} bool TARGET_COMBINE_DIVMOD (enum machine_mode @var{mode}) +This target hook returns @code{true} if the target provides divmod libcall operation for the machine mode @var{mode} and must be used to combine integer division and modulus operations. Return @code{false} otherwise. +@end deftypefn + diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index cc25fec..12974b1 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -11198,3 +11198,6 @@ memory model bits are allowed. @hook TARGET_ATOMIC_TEST_AND_SET_TRUEVAL @hook TARGET_HAS_IFUNC_P + +@hook TARGET_COMBINE_DIVMOD + diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c index b841abd..0db06f1 100644 --- a/gcc/internal-fn.c +++ b/gcc/internal-fn.c @@ -61,6 +61,62 @@ get_multi_vector_move (tree array_type, convert_optab optab) return icode; } +/* Expand DIVMOD call STMT. */ +static void +expand_DIVMOD (gimple stmt) +{ + tree type, lhs, arg0, arg1; + rtx op0, op1, res0, res1, target; + enum machine_mode mode, compute_mode; + rtx libval; + rtx libfunc = NULL_RTX; + bool is_unsigned; + + lhs = gimple_ca
Re: On-Demand range technology [1/5] - Executive Summary
Hi Andrew, Thanks for working on this. Enable elimination of zext/sext with VRP patch had to be reverted in (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the need for value ranges in PROMOTED_MODE precision for at least 1 test case for alpha. Playing with ranger suggest that it is not possible to get value ranges in PROMOTED_MODE precision on demand. Or is there any way we can use on-demand ranger here? Thanks, Kugan On Thu, 23 May 2019 at 11:28, Andrew MacLeod wrote: > > Now that stage 1 has reopened, I’d like to reopen a discussion about the > technology and experiences we have from the Ranger project I brought up > last year. https://gcc.gnu.org/ml/gcc/2018-05/msg00288.html . (The > original wiki pages are now out of date, and I will work on updating > them soon.) > > The Ranger is designed to evaluate ranges on-demand rather than through > a top-down approach. This means you can ask for a range from anywhere, > and it walks back thru the IL satisfying any preconditions and doing the > required calculations. It utilizes a cache to avoid re-doing work. If > ranges are processed in a forward dominator order, it’s not much > different than what we do today. Due to its nature, the order you > process things in has minimal impact on the overall time… You can do it > in reverse dominator order and get similar times. > > It requires no outside preconditions (such as dominators) to work, and > has a very simple API… Simply query the range of an ssa_name at any > point in the IL and all the details are taken care of. > > We have spent much of the past 6 months refining the prototype (branch > “ssa-range”) and adjusting it to share as much code with VRP as > possible. They are currently using a common code base for extracting > ranges from statements, as well as simplifying statements. > > The Ranger deals with just ranges. The other aspects of VRP are > intended to be follow on work that integrates tightly with it, but are > also independent and would be available for other passes to use. These > include: > - Equivalency tracking > - Relational processing > - Bitmask tracking > > We have implemented a VRP pass that duplicates the functionality of EVRP > (other than the bits mentioned above), as well as converted a few other > passes to use the interface.. I do not anticipate those missing bits > having a significant impact on the results. > > The prototype branch it quite stable and can successfully build and test > an entire Fedora distribution (9174 packages). There is an issue with > switches I will discuss later whereby the constant range of a switch > edge is not readily available and is exponentially expensive to > calculate. We have a design to address that problem, and in the common > case we are about 20% faster than EVRP is. > > When utilized in passes which only require ranges for a small number of > ssa-names we see significant improvements. The sprintf warning pass for > instance allows us to remove the calculations of dominators and the > resulting forced walk order. We see a 95% speedup (yes, 1/20th of the > overall time!). This is primarily due to no additional overhead and > only calculating the few things that are actually needed. The walloca > and wrestrict passes are a similar model, but as they have not been > converted to use EVRP ranges yet, we don’t see similar speedups there. > > That is the executive summary. I will go into more details of each > major thing mentioned in follow on notes so that comments and > discussions can focus on one thing at a time. > > We think this approach is very solid and has many significant benefits > to GCC. We’d like to address any concerns you may have, and work towards > finding a way to integrate this model with the code base during this > stage 1. > > Comments and feedback always welcome! > Thanks > Andrew
Re: On-Demand range technology [1/5] - Executive Summary
Hi Andrew, Thanks for looking into it and my apologies for not being clear. My proposal was to use value ranges when expanding gimple to RTL and eliminate redundant zero/sign extensions. I.e., if we know the value generated by some gimple operation is already in the (zero/sign) extended from based on our VR analysis, we could skip the SUBREG and ZERO/SIGN_EXTEND (or set SRP_SIGNED_AND_UNSIGNED and likes). However, the problem is, RTL operations are done in PROMOTE_MODE precision while gimple value range is in natural types. This can be a problem when type wraps (and shows up mainly for targets where PROMOTE_MODE is DImode like alpha). For example, as Uros pointed out with the reverted patch, for alpha-linux we had FAIL: libgomp.fortran/simd7.f90 -O2 execution test FAIL: libgomp.fortran/simd7.f90 -Os execution test The reason being that values wrap and in VR calculation we only records the type precision (which is what matters for gimple) but in order to eliminate the zero/sign extension we need the full precision in the PROMOTE_MODE. Extract from the testcase failing: _343 = ivtmp.179_52 + 2147483645; [0x8004, 0x80043] _344 = _343 * 2; [0x8, 0x86] _345 = (integer(kind=4)) _344; [0x8, 0x86] With the above VR of [0x8, 0x86] (in promoted precision which is [0x10008, 0x10086]), my patch was setting SRP_SIGNED_AND_UNSIGNED which was wrong and causing the error (eliminating and extension which was not redundant). If we had the VR in promoted precision, the patch would be correct and used to eliminate redundant zero/sign extensions. Please let me know if my explanation is not clear and I will show it with more examples. Thanks, Kugan On Fri, 21 Jun 2019 at 23:27, Andrew MacLeod wrote: > > On 6/19/19 11:04 PM, Kugan Vivekanandarajah wrote: > > Hi Andrew, > > Thanks for working on this. > > Enable elimination of zext/sext with VRP patch had to be reverted in > (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the > need for value ranges in PROMOTED_MODE precision for at least 1 test > case for alpha. > > Playing with ranger suggest that it is not possible to get value > ranges in PROMOTED_MODE precision on demand. Or is there any way we > can use on-demand ranger here? > > Thanks, > Kugan > > > > I took a look at the thread, but I think I'm still missing some context. > > Lets go back to the beginning. What is an example of the case we arent > getting that you want to get? > > I'm going to guess to start :-) > > short foo(unsigned char c) > { >c = c & (unsigned char)0x0F; >if( c > 7 ) > return((short)(c - 5)); >else > return(( short )c); > } > > > > A run of this thru the ranger shows me code that looks like (on x86 anyway): > > === BB 2 > c_4(D) [0, 255] unsigned char > : > c_5 = c_4(D) & 15; > _9 = c_4(D) & 8; > if (_9 != 0) > goto ; [INV] > else > goto ; [INV] > > c_5 : [0, 15] unsigned char > _9 : [0, 0][8, 8] unsigned char > 2->3 (T)*c_5 : [0, 15] unsigned char > 2->3 (T) _9 : [8, 8] unsigned char > 2->4 (F)*c_5 : [0, 15] unsigned char > 2->4 (F) _9 : [0, 0] unsigned char > > === BB 3 > c_5 [0, 15] unsigned char > : > _1 = (unsigned short) c_5; > _2 = _1 + 65531; > _7 = (short int) _2; > // predicted unlikely by early return (on trees) predictor. > goto ; [INV] > > _1 : [0, 15] unsigned short > _2 : [0, 10][65531, 65535] unsigned short > _7 : [-5, 10] short int > > === BB 4 > c_5 [0, 15] unsigned char > : > _6 = (short int) c_5; > // predicted unlikely by early return (on trees) predictor. > > > I think I see. we aren't adjusting the range of c_5 going into blocks 3 and > 4. its obvious from the original source where the code says < 7, but once > its been "bitmasked" that info becomes obfuscated. > . > If you were to see a range in bb3 of c_5 = [8,15], and a range in bb4 of c_4 > = [0,7], would that solve your problem? > > so in bb3 , we'd then see ranges that look like: > > _1 : [8, 15] unsigned short > _2 : [3, 10] unsigned short > _7 : [3, 10] short int > > and then later on we'd see there is no negative/wrap value, and you could > could just drop the extension then? > > SO. > > yes. this is fixable, but is it easy? :-) > > We're in the process of replacing the range extraction code with the > range-ops/gori-computes components from the ranger. This is the part which > figures ranges out from individual statements on exit to a block.. > > We have implemented mostly the same func
Re: Duplicating loops and virtual phis
Hi Bin and Steve, On 17 May 2017 at 19:41, Bin.Cheng wrote: > On Mon, May 15, 2017 at 7:32 PM, Richard Biener > wrote: >> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey >> wrote: >>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote: >>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey >>> om> wrote: >>>> > >>>> > (Short version of this email, is there a way to recalculate/rebuild >>>> > virtual >>>> > phi nodes after modifying the CFG.) >>>> > >>>> > I have a question about duplicating loops and virtual phi nodes. >>>> > I am trying to implement the following optimization as a pass: >>>> > >>>> > Transform: >>>> > >>>> > for (i = 0; i < n; i++) { >>>> >A[i] = A[i] + B[i]; >>>> >C[i] = C[i-1] + D[i]; >>>> > } >>>> > >>>> > Into: >>>> > >>>> > if (noalias between A&B, A&C, A&D) >>>> >for (i = 0; i < 100; i++) >>>> >A[i] = A[i] + B[i]; >>>> >for (i = 0; i < 100; i++) >>>> >C[i] = C[i-1] + D[i]; >>>> > else >>>> >for (i = 0; i < 100; i++) { >>>> >A[i] = A[i] + B[i]; >>>> >C[i] = C[i-1] + D[i]; >>>> >} >>>> > >>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot >>>be >>>> > vectorized so it gives up and does not vectorize the loop. If we >>>split >>>> > up the loop into two loops then the vector add with A[i] could be >>>> > vectorized >>>> > even if the one with C[i] could not. >>>> Loop distribution does this transform but it doesn't know about >>>> versioning for unknown dependences. >>>> >>> >>>Yes, I looked at loop distribution. But it only works with global >>>arrays and not with pointer arguments where it doesn't know the size of >>>the array being pointed at. I would like to be able to have it work >>>with pointer arguments. If I call a function with 2 or >>>more integer pointers, and I have a loop that accesses them with >>>offsets between 0 and N where N is loop invariant then I should have >>>enough information (at runtime) to determine if there are overlapping >>>memory accesses through the pointers and determine whether or not I can >>>distribute the loop. >> >> Not sure where you got that from. Loop distribution works with our data >> reference / dependence analysis. The cost model might be more restricted >> but that can be fixed. >> >>>The loop splitting code seemed like a better template since it already >>>knows how to split a loop based on a runtime determined condition. That >>>part seems to be working for me, it is when I try to >>>distribute/duplicate one of those loops (under the unaliased condition) >>>that I am running into the problem with virtual PHIs. >> >> There's mark_virtual*for_renaming (sp?). >> >> But as said you are performing loop distribution so please enhance the >> existing pass rather than writing a new one. > I happen to be working on loop distribution now (If guess correctly, > to get hmmer fixed). So far my idea is to fuse the finest distributed > loop in two passes, in the first pass, we merge all SCCs due to "true" > data dependence; in the second one we identify all SCCs and breaks > them on dependent edges due to possible alias. Breaking SCCs with > minimal edge set can be modeled as Feedback arc set problem which is > NP-hard. Fortunately the problem is small in our case and there are > approximation algorithms. OTOH, we should also improve loop > distribution/fusion to maximize parallelism / minimize > synchronization, as well as maximize data locality, but I think this > is not needed to get hmmer vectorized. I am also looking into vectoring homer loop. Glad to know you are also looking at this problem and looking forward to seeing the patches. I have some experimental patches where I added the data reference that needs runtime checking to a list static int pg_add_dependence_edges (struct graph *rdg, vec loops, int dir, vec drs1, - vec drs2) + vec drs2, + vec &ddrs, + bool runtime_alias_check) Then I am vesioning the main loop based on the condition generated from the runtime check. I have borrowed the logic from vectorizer (like pruning the list and generating the condition). I have neither verified nor benchmarked it enough yet. As I understand, we also should have some form of cost model where we should be able too see the data access patterns and decide if the distributed loops can be vectorized? Cost model in similar_memory_accesses also need to be relaxd based on the ability to vectorize distributed loops. Thanks, Kugan > Thanks, > bin >> >> Richard. >> >>>Steve Ellcey >>>sell...@cavium.com >>
Loop reversal
I am looking into reversing loop to increased efficiency. There is already a PR22041 for this and an old patch https://gcc.gnu.org/ml/gcc-patches/2006-01/msg01851.html by Zdenek which never made it to mainline. For constant loop count, ivcanon pass is adding reverse iv but this not selected by ivopt. For example: void copy (unsigned int N, double *a, double *c) { for (int i = 0; i < 800; ++i) c[i] = a[i]; } ivcanon pass Added canonical iv to loop 1, 799 iterations. ivtmp_14 = ivtmp_15 – 1; in ivopt, it selects candidates 10 Candidate 10: Var befor: ivtmp.11 Var after: ivtmp.11 Incr POS: before exit test IV struct: Type: sizetype Base: 0 Step: 8 Biv: N If we look at the group : Group 0: Type: ADDRESS Use 0.0: At stmt: _5 = *_3; At pos: *_3 IV struct: Type: double * Base: a_9(D) Step: 8 Object: (void *) a_9(D) Biv: N Overflowness wrto loop niter: Overflow Group 1: Type: ADDRESS Use 1.0: At stmt: *_4 = _5; At pos: *_4 IV struct: Type: double * Base: c_10(D) Step: 8 Object: (void *) c_10(D) Biv: N Overflowness wrto loop niter: Overflow Group 2: Type: COMPARE Use 2.0: At stmt: if (ivtmp_14 != 0) At pos: ivtmp_14 IV struct: Type: unsigned int Base: 799 Step: 4294967295 Biv: Y Overflowness wrto loop niter: Overflow ivopt cost model assumes that group0 and 1 will have infinite cost for the iv added by ivcanon pass because of the lower precision with the IV added by ivcanon pass. If I change the example to: void copy (unsigned int N, double *a, double *c) { for (long i = 0; i < 800; ++i) c[i] = a[i]; } It still has higher cost for group0 and 1 due to the negative step. I think this can be improved. My question is: 1. For the case where the loop count is not constant, can we make ivcanon to add reverse IV with the current implementation. Can ivopt be taught to select the reverse iv ? 2. Or is the patch by Zdenek a better option. I am re-basing it for the trunk. Thanks, Kugan
Re: [RFC] type promotion pass
Hi Richard, On 16 September 2017 at 06:12, Richard Biener wrote: > On September 15, 2017 6:56:04 PM GMT+02:00, Jeff Law wrote: >>On 09/15/2017 10:19 AM, Segher Boessenkool wrote: >>> On Fri, Sep 15, 2017 at 09:18:23AM -0600, Jeff Law wrote: >>>> WORD_REGISTER_OPERATIONS works with PROMOTE_MODE. The reason you >>can't >>>> define WORD_REGISTER_OPERATIONS on aarch64 is because that the >>implicit >>>> promotion is sometimes to 32 bits and sometimes to 64 bits. >>>> WORD_REGISTER_OPERATIONS can't really describe that. >>> >>> WORD_REGISTER_OPERATIONS isn't well-defined. >>> >>> """ >>> @defmac WORD_REGISTER_OPERATIONS >>> Define this macro to 1 if operations between registers with integral >>mode >>> smaller than a word are always performed on the entire register. >>> Most RISC machines have this property and most CISC machines do not. >>> @end defmac >>> """ >>> >>> Exactly what operations? For almost all targets it isn't true for >>*all* >>> operations. Or no targets even, if you include rotate, etc. >>> >>> For targets that have both 32-bit and 64-bit operations it is never >>true >>> either. >>> >>>> And I'm also keen on doing something with type promotion -- Kai did >>some >>>> work in this space years ago which I found interesting, even if the >>work >>>> didn't go forward. It showed a real weakness. So I'm certainly >>>> interested in looking at Prathamesh's work -- with the caveat that >>if it >>>> stumbles across the same issues as Kai's work that it likely >>wouldn't be >>>> acceptable in its current form. >>> >>> Doing type promotion too aggressively reduces code quality. "Just" >>find >>> a sweet spot :-) >>> >>> Example: on Power, an AND of QImode with 0xc3 is just one insn, which >>> actually does a SImode AND with 0xffc3. This is what we do >>currently. >>> A SImode AND with 0x00c3 is two insns, or one if we allow it to >>write >>> to CR0 as well ("andi."); same for DImode, except there isn't a way >>to do >>> an AND with 0xffc3 in one insn at all. >>> >>> unsigned char a; >>> void f(void) { a &= 0xc3; }; >>Yes, these are some of the things we kicked around. One of the most >>interesting conclusions was that for these target issues we'd really >>like a target.pd file to handle this class of transformations just >>prior >>to rtl expansion. >> >>Essentially early type promotion/demotion would be concerned with cases >>where we can eliminate operations in a target independent manner and >>narrow operands as much as possible. Late promotion/demotion would >>deal >>with stuff like the target's desire to work on specific sized hunks in >>specific contexts. >> >>I'm greatly oversimplifying here. Type promotion/demotion is fairly >>complex to get right. > > I always thought we should start with those promotions that are done by RTL > expansion according to PROMOTE_MODE and friends. The complication is that > those promotions also apply to function calls and arguments and those are > difficult to break apart from other ABI specific details. > > IIRC the last time we went over this patch I concluded a better first step > would be to expose call ABI details on GIMPLE much earlier. But I may > misremember here. I think this would be very useful. Some of the regressions in type promotion comes from parameters/return values. ABI in some cases guarantees that they are properly extended but during type promotion we promote (or extend) leading to additional extensions. We might also need some way of having gimple statements that can convert (or promote to the type without extensions) just to keep the gimple type system happy. Thanks, Kugan > > Basically we couldn't really apply all promotions RTL expansion applies. One > of my ideas with doing them early also was to simplify RTL expansion and > especially promotion issues during SSA coalescing. > > Richard. > >>jeff >
Re: [RFC] type promotion pass
Hi Steve, On 19 September 2017 at 05:45, Steve Ellcey wrote: > On Mon, 2017-09-18 at 23:29 +0530, Prathamesh Kulkarni wrote: >> >> Hi Steve, >> The patch is currently based on r249469. I will rebase it on ToT and >> look into the build failure. >> Thanks for pointing it out. >> >> Regards, >> Prathamesh > > OK, I applied it to that version successfully. The thing I wanted to > check was to see if this helped with PR target/77729. It does not, > so I think even with this patch we would need my patch to address the > issue of having GCC recognize that ldrb/ldhb zero out the top of a > register and thus we do not need to mask it out later. > > https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00929.html I tried the testases you have in the patch with type promotion. Looks like forwprop is reversing the promotion there. I haven't looked in detail yet but -fno-tree-forwprop seems to remove 6 "and" from the test case. I have a slightly different version to what Prathamseh has posted and hope that there isn't any difference here. Thanks, Kugan
Handling prefetcher tag collisions while allocating registers
Hi All, I am wondering if there is anyway we can prefer certain registers in register allocations. That is, I want to have some way of recording register allocation decisions (for loads in loop that are accessed in steps) and use this to influence register allocation of other loads (again that are accessed in steps). This is for architectures (like falkor AArch64) that use hardware perefetchers that use signatures of the loads to lock into and tune prefetching parameters. Ideally, If the loads are from the same stream, they should have same signature and if they are from different stream, they should have different signature. Destination, base register and offset are used in the signature. Therefore, selecting different register can influence this. In LLVM, this is implemented as a machine specific pass that runs after register allocation. It then inserts mov instruction with scratch registers to manage this. We can do a machine reorg pass in gcc but detecting strided loads at that stage is not easy. I am trying to implement this in gcc and wondering what is the preferred and acceptable way to implement this. Any thoughts ? Thanks, Kugan
Re: Handling prefetcher tag collisions while allocating registers
Hi Bin, On 24 October 2017 at 18:29, Bin.Cheng wrote: > On Tue, Oct 24, 2017 at 12:44 AM, Kugan Vivekanandarajah > wrote: >> Hi All, >> >> I am wondering if there is anyway we can prefer certain registers in >> register allocations. That is, I want to have some way of recording >> register allocation decisions (for loads in loop that are accessed in >> steps) and use this to influence register allocation of other loads >> (again that are accessed in steps). >> >> This is for architectures (like falkor AArch64) that use hardware >> perefetchers that use signatures of the loads to lock into and tune >> prefetching parameters. Ideally, If the loads are from the same >> stream, they should have same signature and if they are from different >> stream, they should have different signature. Destination, base >> register and offset are used in the signature. Therefore, selecting >> different register can influence this. > I wonder why the destination register is used in signature. In an extreme > case, > load in loop can be unrolled then allocated to different dest registers. > Forcing > the same dest register could be too restricted. My description is very simplified. Signature is based on part of the register number. Thus, two registers can have same signature. What we don't want is to have collisions when they are from two different memory stream. So this is not an issue. Thanks, Kugan > > Thanks, > bin > >> >> In LLVM, this is implemented as a machine specific pass that runs >> after register allocation. It then inserts mov instruction with >> scratch registers to manage this. We can do a machine reorg pass in >> gcc but detecting strided loads at that stage is not easy. >> >> I am trying to implement this in gcc and wondering what is the >> preferred and acceptable way to implement this. Any thoughts ? >> >> Thanks, >> Kugan
Re: Global analysis of RTL
Hi, On 26 October 2017 at 14:13, R0b0t1 wrote: > On Thu, Oct 19, 2017 at 8:46 AM, Geoff Wozniak wrote: >> R0b0t1 writes: >>> >>> When I first looked at the GCC codebase, it seemed to me that most >>> operations should be done on the GIMPLE representation as it contains the >>> most information. Is there any reason you gravitated towards RTL? >> >> >> Naiveté, really. >> >> My team and I didn’t know much about the code base when we started looking >> at the problem, although we knew a little about the intermediate formats. >> GIMPLE makes the analysis more complicated, although not impossible, and it >> can make the cost model difficult to pin down. Raw assembly/machine code is >> ideal, but then we have to deal with different platforms and would likely >> have to do all the work in the linker. RTL is sufficiently low-level enough >> (as far as we know) to start counting instructions, and platform independent >> enough that we don’t have to parse machine code. >> >> Essentially, working with RTL makes the implementation a little easier but >> we didn’t know that the pass infrastructure wasn’t in our favour. >> >> It’s likely we’ll turn our attention to GIMPLE and assembler/machine code, >> unless we can come up with something (or anyone has a suggestion). >> > > Admittedly I do not know much about compiler design, but your response > has put some of what I read about analysis of RTL into context. It it > is hard to be sure, but I think analysis of RTL has fallen out of > favor and has been replaced with the analysis of intermediate > languages. For example, compare clang and llvm's operation. It is not really being replaced (at least I am not aware of it). It is true that more and more of the high level optimisations are moved to gimple. When we move high level intermediate format to lower level intermediate format, we tend to lose some information and gets more closer to machine representation. An obvious example is, in RTL sign is not represented. Even in RTL, after reload, we will have one to one mapping fro RTL to actual machine instruction (i.e. more closer to asm). In short, gcc goes from generic to gimple to RTL as stamens are lowered from high level languages to asm. Thanks, Kugan > > The missing link is that you seem to be right about cost calculation. > Cost calculation is difficult for high level operations. Would online > analysis of the produced machine code be sufficient? That seems to be > a popular solution from what I have read. > > Thanks for the response, and best of luck to you. > > Cheers, > R0b0t1.
Re: Problems in IPA passes
Hi Jeff, On 28 October 2017 at 18:28, Jeff Law wrote: > > Jan, > > What's the purpose behind calling vrp_meet and > extract_range_from_unary_expr from within the IPA passes? This is used such that when we have an argument to a function and this for which we know the VR and this intern is passed as a parameter to another. For example: void foo (int i) { ... bar (unary_op (i)) ... } This is mainly to share what is done in tree-vrp. > > AFAICT that is not safe to do. Various paths through those routines > will access static objects within tree-vrp.c which may not be > initialized when IPA runs (vrp_equiv_obstack, vr_value). IPA-VRP does not track equivalence and vr_value is not used. Thanks, Kugan > > While this seems to be working today, it's a failure waiting to happen. > > Is there any way you can avoid using those routines? I can't believe > you really need all the complexity of those routines, particularly > extract_range_from_unary_expr. Plus it's just downright fugly from a > modularity standpoint. > > > ? > > Jeff
Re: Problems in IPA passes
Hi Jeff, On 31 October 2017 at 14:47, Jeff Law wrote: > On 10/29/2017 03:54 PM, Kugan Vivekanandarajah wrote: >> Hi Jeff, >> >> On 28 October 2017 at 18:28, Jeff Law wrote: >>> >>> Jan, >>> >>> What's the purpose behind calling vrp_meet and >>> extract_range_from_unary_expr from within the IPA passes? >> >> This is used such that when we have an argument to a function and this >> for which we know the VR and this intern is passed as a parameter to >> another. For example: >> >> void foo (int i) >> { >> ... >> bar (unary_op (i)) >> ... >> } >> >> This is mainly to share what is done in tree-vrp. > Presumably you never have equivalences or anything like that, which > probably helps with not touching vrp_bitmap_obstack which isn't > initialized when you run the IPA bits. > >>> >>> AFAICT that is not safe to do. Various paths through those routines >>> will access static objects within tree-vrp.c which may not be >>> initialized when IPA runs (vrp_equiv_obstack, vr_value). >> >> IPA-VRP does not track equivalence and vr_value is not used. > But there's no enforcement and I'd be hard pressed to believe that all > the paths through the routines you use in tree-vrp aren't going to touch > vr_value, or vrp_bitmap_obstack. vrp_bitmap_obstack turns out to be > incredibly tangled into the implementations within tree-vrp.c :( > I looked into the usage and it does seem to be not using vr_value unless I am missing something. There are two overloaded functions here: extract_range_from_unary_expr (value_range *vr, enum tree_code code, tree type, value_range *vr0_, tree op0_type) is safe as this works with value_range and does not use get_value_range to access vr_value. extract_range_from_unary_expr (value_range *vr, enum tree_code code, tree type, tree op0) This is not safe as this takes tree as an argument and gets value_range by calling get_value_range. May be we should change the names to reflect this. Thanks, Kugan
Re: poly_uint64 / TYPE_VECTOR_SUBPARTS question
Hi, On 9 February 2018 at 09:08, Steve Ellcey wrote: > I have a question about the poly_uint64 type and the TYPE_VECTOR_SUBPARTS > macro. I am trying to copy some code from i386.c into my aarch64 build > that is basically: > > int n; > n = TYPE_VECTOR_SUBPARTS (type_out); > > And it is not compiling for me, I get: > > /home/sellcey/gcc-vectmath/src/gcc/gcc/config/aarch64/aarch64-builtins.c:1504:37: > error: cannot convert ‘poly_uint64’ {aka ‘poly_int<2, long unsigned int>’} > to ‘int’ in assignment >n = TYPE_VECTOR_SUBPARTS (type_out); AFIK, you could use to_constant () if known to be a compile time constant. Thanks, Kugan > > My first thought was that I was missing a header file but I put > all the header includes that are in i386.c into aarch64-builtins.c > and it still does not compile. It works on the i386 side. It looks > like poly-int.h and poly-int-types.h are included by coretypes.h > and I include that header file so I don't understand why this isn't > compiling and what I am missing. Any help? > > Steve Ellcey > sell...@cavium.com
Generating gimple assign stmt that changes sign
Hi, I am looking to introduce ABSU_EXPR and that would create: unsigned short res = ABSU_EXPR (short); Note that the argument is signed and result is unsigned. As per the review, I have a match.pd entry to generate this as: (simplify (abs (convert @0)) (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0))) (convert (absu @0 Now when gimplifying the converted tree, how do we tell that ABSU_EXPR will take a signed arg and return unsigned. I will have other match.pd entries so this will be generated while in gimple.passes too. Should I add new functions in gimple.[h|c] for this. Is there any examples I can refer to. Conversion expressions seems to be the only place where sign can change in gimple assignment but they are very specific. Thanks, Kugan
Re: Generating gimple assign stmt that changes sign
Hi Jeff, Thanks for the prompt reply. On 22 May 2018 at 09:10, Jeff Law wrote: > On 05/21/2018 04:50 PM, Kugan Vivekanandarajah wrote: >> Hi, >> >> I am looking to introduce ABSU_EXPR and that would create: >> >> unsigned short res = ABSU_EXPR (short); >> >> Note that the argument is signed and result is unsigned. As per the >> review, I have a match.pd entry to generate this as: >> (simplify (abs (convert @0)) >> (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0))) >> (convert (absu @0 >> >> >> Now when gimplifying the converted tree, how do we tell that ABSU_EXPR >> will take a signed arg and return unsigned. I will have other match.pd >> entries so this will be generated while in gimple.passes too. Should I >> add new functions in gimple.[h|c] for this. >> >> Is there any examples I can refer to. Conversion expressions seems to >> be the only place where sign can change in gimple assignment but they >> are very specific. > What's the value in representing ABSU vs a standard ABS followed by a > conversion? It is based on PR https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946. Specifically, comment 13. > > You'll certainly want to do verification of the type signedness in the > gimple verifier. I am doing it and it is failing now. > > In general the source and destination types have to be the same. > Conversions are the obvious exception. There's a few other nodes that > have more complex type rules (MEM_REF, COND_EXPR and a few others). But > I don't think they're necessarily going to be helpful. Thanks, Kugan > > jeff
Sched1 stability issue
Hi, We noticed a difference in the code generated for aarch64 gcc 7.2 hosted in Linux vs mingw. AFIK, we are supposed to produce the same output. For the testacse we have (quite large and I am trying to reduce), the difference comes from sched1 pass. If I disable sched1 the difference is going away. Is this a known issue? Attached is the sched1 dump snippet where there is the difference. Thanks, Kugan verify found no changes in insn with uid = 41. starting the processing of deferred insns ending the processing of deferred insns df_analyze called Pass 0 for finding pseudo/allocno costs r84 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:2 FP_REGS:2 ALL_REGS:2 MEM:8000 r83 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:2 FP_REGS:2 ALL_REGS:2 MEM:8000 r80 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1 ALL_REGS:1 MEM:8000 r79 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:4000 FP_REGS:4000 ALL_REGS:1 MEM:8000 r78 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:4000 FP_REGS:4000 ALL_REGS:1 MEM:8000 r77 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:9000 FP_REGS:9000 ALL_REGS:1 MEM:8000 Pass 1 for finding pseudo/allocno costs r86: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r85: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r84: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r83: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r82: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r81: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r80: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r79: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r78: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r77: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r76: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r75: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r74: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r73: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r72: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r71: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r70: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r69: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r68: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r67: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r84 costs: GENERAL_REGS:0 FP_LO_REGS:2 FP_REGS:2 ALL_REGS:2 MEM:8000 r83 costs: GENERAL_REGS:0 FP_LO_REGS:2 FP_REGS:2 ALL_REGS:2 MEM:8000 r80 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1 ALL_REGS:1 MEM:8000 r79 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1 ALL_REGS:1 MEM:8000 r78 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1 ALL_REGS:1 MEM:8000 r77 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1 ALL_REGS:1 MEM:8000 ;; == ;; -- basic block 2 from 3 to 48 -- before reload ;; == ;; 0--> b 0: i 24 r77=ap-0x40 :cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0) ;; 0--> b 0: i 26 r78=0xffc8 :cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0) ;; 1--> b 0: i 25 [sfp-0x10]=r77 :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(-1)@FP_REGS+0(0) -- -;; 1--> b 0: i 9 [ap-0x8]=x7 :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(-1)@FP_REGS+0(0) -- -;; 2--> b 0: i 22 [sfp-0x20]=ap :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(0)FP_REGS+0(0) +;; 1--> b 0: i 22 [sfp-0x20]=ap :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(0) ;; 2--> b 0: i 23 [sfp-0x18]=ap :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(0)FP_REGS+0(0) -;; 3--> b 0: i 27 [sfp-0x8]=r78 :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(-1)FP_REGS+0(0) +;; 2--> b 0: i 27 [sfp-0x8]=r78 :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(-1)FP_REGS+0(0) ;; 3--> b 0: i 28 r79=0xff80 :cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0) -;; 4--> b 0: i 10 [ap-0xc0]=v0 :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(-1) +;; 3--> b 0: i 10 [ap-0xc0]=v0 :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(-1) ;;