Re: RFC: LRA for x86/x86-64 [0/9]
On Fri, Sep 28, 2012 at 12:56 AM, Vladimir Makarov wrote: > Originally I was to submit LRA at the very beginning of stage1 for > gcc4.9 as it was discussed on this summer GNU Tools Cauldron. After > some thinking, I've decided to submit LRA now but only switched on for > *x86/x86-64* target. The reasons for that are > o I am already pretty confident in LRA for this target with the > point of reliability, performance, code size, and compiler speed. > o I am confident that I can fix LRA bugs and pitfalls which might be > recognized and reported during stage2 and 3 of gcc4.8. > o Wider LRA testing for x86/x86-64 will make smoother a hard transition of > other targets to LRA during gcc4.9 development. > > During development of gcc4.9, I'd like to switch major targets to > LRA as it was planned before. I hope that all targets will be > switched for the next release after gcc4.9 (although it will be > dependent mostly on the target maintainers). When/if it is done, > reload and reload oriented machine-dependent code can be removed. > > LRA project was reported on 2012 GNU Tools Cauldron > (http://gcc.gnu.org/wiki/cauldron2012). The presentation contains a > high-level description of LRA and the project status. > > The following patches makes LRA working for x86/x86-64. Separately > patches mostly do nothing until the last patch switches on LRA for > x86/x86-64. Although compiler is bootstrapped after applying each > patch in given order, the division is only for review convenience. > > Any comments and proposals are appreciated. Even if GCC community > decides that it is too late to submit it to gcc4.8, the earlier reviews > are always useful. >From a release-manager point of view the patch is "in time" for 4.8, in that it is during stage1 (which I expect to last another two to four weeks). Note that there is no such thing as "stage2" anymore but we go straight to "stage3" (bugfixing mode, no new features) from stage1. After three months of stage3 we go into regression-fixes only mode for as long as there are release-blocking bugs (regressions with priority P1). You will have roughly half a year to fix LRA for 4.8.0 after stage1 closes. Thanks, Richard. > The patches were successfully bootstrapped and tested for x86/x86-64. >
Re: vec_cond_expr adjustments
On Fri, Sep 28, 2012 at 12:42 AM, Marc Glisse wrote: > Hello, > > I have been experimenting with generating VEC_COND_EXPR from the front-end, > and these are just a couple things I noticed. > > 1) optabs.c requires that the first argument of vec_cond_expr be a > comparison, but verify_gimple_assign_ternary only checks is_gimple_condexpr, > like for COND_EXPR. In the long term, it seems better to also allow ssa_name > and vector_cst (thus match the is_gimple_condexpr condition), but for now I > just want to know early if I created an invalid vec_cond_expr. optabs should be fixed instead, an is_gimple_val condition is implicitely val != 0. > 2) a little refactoring of the code to find a suitable vector type for > comparison results, and one more place where it should be used (no testcase > yet because I don't know if that path can be taken without front-end changes > first). Yes, it looks fine to me. > I did wonder, for tree-ssa-forwprop, about using directly TREE_TYPE > (cond) without truth_type_for. Yes, that should work. > Hmm, now I am wondering whether I should have waited until I had front-end > vec_cond_expr support to submit everything at once... ;) The tree.[ch] and gimple-fold.c hunks are ok if tested properly, the tree-ssa-forwprop.c idea of using TREE_TYPE (cond), too. I don't like the tree-cfg.c change, instead re-factor optabs.c to get a decomposed cond for vector_compare_rtx and appropriately "decompose" a non-comparison-class cond in expand_vec_cond_expr. If we for example have predicate = a < b; x = predicate ? d : e; y = predicate ? f : g; we ideally want to re-use the predicate computation on targets where that would be optimal (and combine should be able to recover the case where it is not). Thanks, Richard. > 2012-09-27 Marc Glisse > > * tree-cfg.c (verify_gimple_assign_ternary): Stricter check on > first argument of VEC_COND_EXPR. > * tree.c (truth_type_for): New function. > * tree.h (truth_type_for): Declare. > * gimple-fold.c (and_comparisons_1): Call it. > (or_comparisons_1): Likewise. > * tree-ssa-forwprop.c (forward_propagate_into_cond): Likewise. > > -- > Marc Glisse > Index: gcc/tree-ssa-forwprop.c > === > --- gcc/tree-ssa-forwprop.c (revision 191810) > +++ gcc/tree-ssa-forwprop.c (working copy) > @@ -549,21 +549,22 @@ static bool > forward_propagate_into_cond (gimple_stmt_iterator *gsi_p) > { >gimple stmt = gsi_stmt (*gsi_p); >tree tmp = NULL_TREE; >tree cond = gimple_assign_rhs1 (stmt); >bool swap = false; > >/* We can do tree combining on SSA_NAME and comparison expressions. */ >if (COMPARISON_CLASS_P (cond)) > tmp = forward_propagate_into_comparison_1 (stmt, TREE_CODE (cond), > - boolean_type_node, > + truth_type_for > +(TREE_TYPE (cond)), >TREE_OPERAND (cond, 0), >TREE_OPERAND (cond, 1)); >else if (TREE_CODE (cond) == SSA_NAME) > { >enum tree_code code; >tree name = cond; >gimple def_stmt = get_prop_source_stmt (name, true, NULL); >if (!def_stmt || !can_propagate_from (def_stmt)) > return 0; > > Index: gcc/tree-cfg.c > === > --- gcc/tree-cfg.c (revision 191810) > +++ gcc/tree-cfg.c (working copy) > @@ -3758,22 +3758,24 @@ verify_gimple_assign_ternary (gimple stm >tree rhs2_type = TREE_TYPE (rhs2); >tree rhs3 = gimple_assign_rhs3 (stmt); >tree rhs3_type = TREE_TYPE (rhs3); > >if (!is_gimple_reg (lhs)) > { >error ("non-register as LHS of ternary operation"); >return true; > } > > - if (((rhs_code == VEC_COND_EXPR || rhs_code == COND_EXPR) > - ? !is_gimple_condexpr (rhs1) : !is_gimple_val (rhs1)) > + if (((rhs_code == COND_EXPR) ? !is_gimple_condexpr (rhs1) > + : (rhs_code == VEC_COND_EXPR) ? (!is_gimple_condexpr (rhs1) > + || is_gimple_val (rhs1)) > + : !is_gimple_val (rhs1)) >|| !is_gimple_val (rhs2) >|| !is_gimple_val (rhs3)) > { >error ("invalid operands in ternary operation"); >return true; > } > >/* First handle operations that involve different types. */ >switch (rhs_code) > { > Index: gcc/gimple-fold.c > === > --- gcc/gimple-fold.c (revision 191810) > +++ gcc/gimple-fold.c (working copy) > @@ -23,21 +23,20 @@ along with GCC; see the file COPYING3. > #include "coretypes.h" > #include "tm.h" > #include "tree.h" > #include "flags.h" > #include "function.h" > #include "dumpfile.h" > #include "tree-flow.h" > #include "t
[PATCH][LTO] "properly" stream PARM_DECLs
This avoids streaming PARM_DECLs both in the global type/decl and the local function sections. They are needed on the global level, so properly stream them there. Fixes part of the issues we have with debug information for early inlining, too. LTO Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2012-09-28 Richard Guenther PR lto/47799 * lto-streamer-out.c (tree_is_indexable): Make PARM_DECLs global. (lto_output_tree_ref): Handle references to them. (output_function): Do not output function arguments again. * lto-streamer-in.c (input_function): Do not input arguments again, nor overwrite them. Index: gcc/lto-streamer-out.c === *** gcc/lto-streamer-out.c.orig 2012-09-26 16:47:18.0 +0200 --- gcc/lto-streamer-out.c 2012-09-28 11:18:35.438055184 +0200 *** static bool *** 125,131 tree_is_indexable (tree t) { if (TREE_CODE (t) == PARM_DECL) ! return false; else if (TREE_CODE (t) == VAR_DECL && decl_function_context (t) && !TREE_STATIC (t)) return false; --- 125,131 tree_is_indexable (tree t) { if (TREE_CODE (t) == PARM_DECL) ! return true; else if (TREE_CODE (t) == VAR_DECL && decl_function_context (t) && !TREE_STATIC (t)) return false; *** lto_output_tree_ref (struct output_block *** 237,242 --- 237,243 case VAR_DECL: case DEBUG_EXPR_DECL: gcc_assert (decl_function_context (expr) == NULL || TREE_STATIC (expr)); + case PARM_DECL: streamer_write_record_start (ob, LTO_global_decl_ref); lto_output_var_decl_index (ob->decl_state, ob->main_stream, expr); break; *** output_function (struct cgraph_node *nod *** 806,814 output_struct_function_base (ob, fn); - /* Output the head of the arguments list. */ - stream_write_tree (ob, DECL_ARGUMENTS (function), true); - /* Output all the SSA names used in the function. */ output_ssa_names (ob, fn); --- 807,812 Index: gcc/lto-streamer-in.c === *** gcc/lto-streamer-in.c.orig 2012-09-21 10:59:45.0 +0200 --- gcc/lto-streamer-in.c 2012-09-28 11:14:53.835068419 +0200 *** input_function (tree fn_decl, struct dat *** 823,829 gimple *stmts; basic_block bb; struct cgraph_node *node; - tree args, narg, oarg; fn = DECL_STRUCT_FUNCTION (fn_decl); tag = streamer_read_record_start (ib); --- 823,828 *** input_function (tree fn_decl, struct dat *** 834,855 input_struct_function_base (fn, data_in, ib); - /* Read all function arguments. We need to re-map them here to the - arguments of the merged function declaration. */ - args = stream_read_tree (ib, data_in); - for (oarg = args, narg = DECL_ARGUMENTS (fn_decl); -oarg && narg; -oarg = TREE_CHAIN (oarg), narg = TREE_CHAIN (narg)) - { - unsigned ix; - bool res; - res = streamer_tree_cache_lookup (data_in->reader_cache, oarg, &ix); - gcc_assert (res); - /* Replace the argument in the streamer cache. */ - streamer_tree_cache_insert_at (data_in->reader_cache, narg, ix); - } - gcc_assert (!oarg && !narg); - /* Read all the SSA names. */ input_ssa_names (ib, data_in, fn); --- 833,838
Re: [PATCH] Fix up vector CONSTRUCTOR handling (PR tree-optimization/54713)
On Fri, Sep 28, 2012 at 1:31 PM, Jakub Jelinek wrote: > Hi! > > As discussed in the PR, tree-vect-generic.c sometimes creates CONSTRUCTORs > with vector elements to build up larger vectors from vectors supported by > HW. This patch teaches fold-const to bail up on those. Vector CONSTRUCTOR > verification changes and assorted fixes have been separated from the patch > and will be posted probably next week. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Ok. Thanks, Richard. > 2012-09-28 Jakub Jelinek > > PR tree-optimization/54713 > * fold-const.c (vec_cst_ctor_to_array): Give up if vector CONSTRUCTOR > has vector elements. > (fold_ternary_loc) : Likewise. > * tree-vect-generic.c (vector_element): Don't rely on CONSTRUCTOR elts > indexes. Use BIT_FIELD_REF if CONSTRUCTOR has vector elements. > (lower_vec_perm): Use NULL_TREE CONSTRUCTOR indexes. > > * gcc.c-torture/compile/pr54713-1.c: New test. > * gcc.c-torture/compile/pr54713-2.c: New test. > * gcc.c-torture/compile/pr54713-3.c: New test. > > --- gcc/fold-const.c.jj 2012-09-25 11:59:43.0 +0200 > +++ gcc/fold-const.c2012-09-26 13:14:05.639000395 +0200 > @@ -9559,7 +9559,7 @@ vec_cst_ctor_to_array (tree arg, tree *e >constructor_elt *elt; > >FOR_EACH_VEC_ELT (constructor_elt, CONSTRUCTOR_ELTS (arg), i, elt) > - if (i >= nelts) > + if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE) > return false; > else > elts[i] = elt->value; > @@ -14030,22 +14030,35 @@ fold_ternary_loc (location_t loc, enum t > unsigned i; > if (CONSTRUCTOR_NELTS (arg0) == 0) > return build_constructor (type, NULL); > - vals = VEC_alloc (constructor_elt, gc, n); > - for (i = 0; i < n && idx + i < CONSTRUCTOR_NELTS (arg0); > - ++i) > - CONSTRUCTOR_APPEND_ELT (vals, NULL_TREE, > - CONSTRUCTOR_ELT > - (arg0, idx + i)->value); > - return build_constructor (type, vals); > + if (TREE_CODE (TREE_TYPE (CONSTRUCTOR_ELT (arg0, > +0)->value)) > + != VECTOR_TYPE) > + { > + vals = VEC_alloc (constructor_elt, gc, n); > + for (i = 0; > + i < n && idx + i < CONSTRUCTOR_NELTS (arg0); > + ++i) > + CONSTRUCTOR_APPEND_ELT (vals, NULL_TREE, > + CONSTRUCTOR_ELT > + (arg0, idx + i)->value); > + return build_constructor (type, vals); > + } > } > } > else if (n == 1) > { > if (TREE_CODE (arg0) == VECTOR_CST) > return VECTOR_CST_ELT (arg0, idx); > - else if (idx < CONSTRUCTOR_NELTS (arg0)) > - return CONSTRUCTOR_ELT (arg0, idx)->value; > - return build_zero_cst (type); > + else if (CONSTRUCTOR_NELTS (arg0) == 0) > + return build_zero_cst (type); > + else if (TREE_CODE (TREE_TYPE (CONSTRUCTOR_ELT (arg0, > + 0)->value)) > + != VECTOR_TYPE) > + { > + if (idx < CONSTRUCTOR_NELTS (arg0)) > + return CONSTRUCTOR_ELT (arg0, idx)->value; > + return build_zero_cst (type); > + } > } > } > } > --- gcc/tree-vect-generic.c.jj 2012-09-18 12:14:48.0 +0200 > +++ gcc/tree-vect-generic.c 2012-09-26 14:23:40.742171292 +0200 > @@ -1050,14 +1050,13 @@ vector_element (gimple_stmt_iterator *gs > >if (TREE_CODE (vect) == VECTOR_CST) > return VECTOR_CST_ELT (vect, index); > - else if (TREE_CODE (vect) == CONSTRUCTOR) > + else if (TREE_CODE (vect) == CONSTRUCTOR > + && (CONSTRUCTOR_NELTS (vect) == 0 > + || TREE_CODE (TREE_TYPE (CONSTRUCTOR_ELT (vect, 0)->value)) > + != VECTOR_TYPE)) > { > - unsigned i; > - tree elt_i, elt_v; > - > - FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (vect), i, elt_i, elt_v) > -if (operand_equal_p (elt_i, idx, 0)) > - return elt_v; > + if (index < CONSTRUCTOR_NELTS (vect)) > + return CONSTRUCTOR_ELT (vect, index)->value; >return build_zero_cst (vect_elt_type
Re: RFC: LRA for x86/x86-64 [0/9]
On Sat, Sep 29, 2012 at 10:26 PM, Steven Bosscher wrote: > Hi Vlad, > > Thanks for the testing and the logs. You must have good hardware, your > timings are all ~3 times faster than mine :-) > > On Sat, Sep 29, 2012 at 3:01 AM, Vladimir Makarov wrote: >> --32-bit >> Reload: >> 581.85user 29.91system 27:15.18elapsed 37%CPU (0avgtext+0avgdata >> LRA: >> 629.67user 24.16system 24:31.08elapsed 44%CPU (0avgtext+0avgdata > > This is a ~8% slowdown. > > >> --64-bit:--- >> Reload: >> 503.26user 36.54system 30:16.62elapsed 29%CPU (0avgtext+0avgdata >> LRA: >> 598.70user 30.90system 27:26.92elapsed 38%CPU (0avgtext+0avgdata > > This is a ~19% slowdown I think both measurements run into swap (low CPU utilization), from the LRA numbers I'd say that LRA uses less memory but the timings are somewhat useless with the swapping. >> Here is the numbers for PR54146 on the same machine with -O1 only for >> 64-bit (compiler reports error for -m32). > > Right, the test case is for 64-bits only, I think it's preprocessed > code for AMD64. > >> Reload: >> 350.40user 21.59system 17:09.75elapsed 36%CPU (0avgtext+0avgdata >> LRA: >> 468.29user 21.35system 15:47.76elapsed 51%CPU (0avgtext+0avgdata > > This is a ~34% slowdown. > > To put it in another perspective, here are my timings of trunk vs lra > (both checkouts done today): > > trunk: > integrated RA : 181.68 (24%) usr 1.68 (11%) sys 183.43 > (24%) wall 643564 kB (20%) ggc > reload : 11.00 ( 1%) usr 0.18 ( 1%) sys 11.17 ( > 1%) wall 32394 kB ( 1%) ggc > TOTAL : 741.6414.76 756.41 > 3216164 kB > > lra branch: > integrated RA : 174.65 (16%) usr 1.33 ( 8%) sys 176.33 > (16%) wall 643560 kB (20%) ggc > reload : 399.69 (36%) usr 2.48 (15%) sys 402.69 > (36%) wall 41852 kB ( 1%) ggc > TOTAL :1102.0616.05 1120.83 > 3231738 kB > > That's a 49% slowdown. The difference is completely accounted for by > the timing difference between reload and LRA. > (Timings done on gcc17, which is AMD Opteron(tm) Processor 8354 with > 15GB ram, so swapping is no issue.) > > It looks like the reload timevar is used for LRA. Why not have > multiple timevars, one per phase of LRA? Sth like the patch below > would be nice. This gives me the following timings: > > integrated RA : 189.34 (16%) usr 1.84 (11%) sys 191.18 > (16%) wall 643560 kB (20%) ggc > LRA non-specific: 59.82 ( 5%) usr 0.22 ( 1%) sys 60.12 ( > 5%) wall 18202 kB ( 1%) ggc > LRA virtuals eliminatenon: 56.79 ( 5%) usr 0.03 ( 0%) sys 56.80 ( > 5%) wall 19223 kB ( 1%) ggc > LRA reload inheritance : 6.41 ( 1%) usr 0.01 ( 0%) sys 6.42 ( > 1%) wall1665 kB ( 0%) ggc > LRA create live ranges : 175.30 (15%) usr 2.14 (13%) sys 177.44 > (15%) wall2761 kB ( 0%) ggc > LRA hard reg assignment : 130.85 (11%) usr 0.20 ( 1%) sys 131.17 > (11%) wall 0 kB ( 0%) ggc > LRA coalesce pseudo regs: 2.54 ( 0%) usr 0.00 ( 0%) sys 2.55 ( > 0%) wall 0 kB ( 0%) ggc > reload : 6.73 ( 1%) usr 0.20 ( 1%) sys 6.92 ( > 1%) wall 0 kB ( 0%) ggc > > so the LRA "slowness" (for lack of a better word) appears to be due to > scalability problems in all sub-passes. It would be nice to see if LRA just has a larger constant cost factor compared to reload or if it has bigger complexity. > The code size changes are impressive, but I think that this kind of > slowdown should be addressed before making LRA the default for any > target. Certainly if it shows bigger complexity, not sure for the constant factor (but for sure improvements are welcome). I suppose there is the option to revert back to reload by default for x86_64 as well for 4.8, right? That is, do both reload and LRA co-exist for each target or is it a definite decision target by target? Thanks, Richard. > Ciao! > Steven > > > > > Index: lra-assigns.c > === > --- lra-assigns.c (revision 191858) > +++ lra-assigns.c (working copy) > @@ -1261,6 +1261,8 @@ lra_assign (void) >bitmap_head insns_to_process; >bool no_spills_p; > > + timevar_push (TV_LRA_ASSIGN); > + >init_lives (); >sorted_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ()); >sorted_reload_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ()); > @@ -1312,5 +1314,6 @@ lra_assign (void) >free (sorted_pseudos); >free (sorted_reload_pseudos); >finish_lives (); > + timevar_pop (TV_LRA_ASSIGN); >return no_spills_p; > } > Index: lra.c > === > --- lra.c (revision 191858) > +++ lra.c (working copy) > @@ -2193,6 +2193,7 @@ lra (FILE *f) > >lra_dump_file = f; > > + timevar_p
Re: RFC: LRA for x86/x86-64 [0/9]
On Sun, Sep 30, 2012 at 6:52 PM, Steven Bosscher wrote: > Hi, > > > To look at it in yet another way: > >> integrated RA : 189.34 (16%) usr >> LRA non-specific: 59.82 ( 5%) usr >> LRA virtuals eliminatenon: 56.79 ( 5%) usr >> LRA create live ranges : 175.30 (15%) usr >> LRA hard reg assignment : 130.85 (11%) usr > > The IRA pass is slower than the next-slowest pass (tree PRA) by almost > a factor 2.5. Each of the individually-measured *phases* of LRA is > slower than the complete IRA *pass*. These 5 timevars together make up > for 52% of all compile time. That figure indeed makes IRA + LRA look bad. Did you by chance identify anything obvious that can be done to improve the situation? Thanks, Richard. > IRA already has scalability problems, let's not add more of that with LRA. > > Ciao! > Steven
Re: vec_cond_expr adjustments
On Fri, Sep 28, 2012 at 6:55 PM, Marc Glisse wrote: > On Fri, 28 Sep 2012, Richard Guenther wrote: > >> On Fri, Sep 28, 2012 at 12:42 AM, Marc Glisse >> wrote: >>> >>> Hello, >>> >>> I have been experimenting with generating VEC_COND_EXPR from the >>> front-end, >>> and these are just a couple things I noticed. >>> >>> 1) optabs.c requires that the first argument of vec_cond_expr be a >>> comparison, but verify_gimple_assign_ternary only checks >>> is_gimple_condexpr, >>> like for COND_EXPR. In the long term, it seems better to also allow >>> ssa_name >>> and vector_cst (thus match the is_gimple_condexpr condition), but for now >>> I >>> just want to know early if I created an invalid vec_cond_expr. >> >> >> optabs should be fixed instead, an is_gimple_val condition is implicitely >> val != 0. > > > For vectors, I think it should be val < 0 (with an appropriate cast of val > to a signed integer vector type if necessary). Or (val & highbit) != 0, but > that's longer. I don't think so. Throughout the compiler we generally assume false == 0 and anything else is true. (yes, for FP there is STORE_FLAG_VALUE, but it's scope is quite limited - if we want sth similar for vectors we'd have to invent it). >> The tree.[ch] and gimple-fold.c hunks are ok if tested properly, the >> tree-ssa-forwprop.c idea of using TREE_TYPE (cond), too. > > > Ok, I will retest that way. > > >> I don't like the tree-cfg.c change, instead re-factor optabs.c to >> get a decomposed cond for vector_compare_rtx and appropriately >> "decompose" a non-comparison-class cond in expand_vec_cond_expr. > > > So vector_compare_rtx will take as arguments rcode, t_op0, t_op1 instead of > cond. Yes. > And in expand_vec_cond_expr, if I have a condition, I pass its > elements to vector_compare_rtx, and otherwise I use 0 and the code for > LT_EXPR as the other arguments. Yes, but NE_EXPR and 0 (see above). > >> If we for example have >> >> predicate = a < b; >> x = predicate ? d : e; >> y = predicate ? f : g; >> >> we ideally want to re-use the predicate computation on targets where >> that would be optimal (and combine should be able to recover the >> case where it is not). > > > That I don't understand. The vcond instruction implemented by targets takes > as arguments d, e, cmp, a, b and emits the comparison itself. I don't see > how I can avoid sending to the targets both (d,e,<,a,b) and (f,g,<,a,b). > They will notice eventually that a two, but I don't see how to do that in optabs.c. Or I can compute x = a < b, > use x < 0 as the comparison passed to the targets, and expect targets (those > for which it is true) to recognize that < 0 is useless in a vector condition > (PR54700), or is useless on a comparison result. But that's a limitation of how vcond works. ISTR there is/was a vselect instruction as well, taking a "mask" and two vectors to select from. At least that's how vcond works internally for some sub-targets. Richard. > Thanks for the comments, > > -- > Marc Glisse
Re: [patch] Minor TARGET_MEM_REF cleanup
On Sat, Sep 29, 2012 at 1:17 PM, Eric Botcazou wrote: > Hi, > > for simple loops like: > > extern int a[]; > extern int b[]; > > void foo (int l) > { > int i; > > for (i = 0; i < l; i++) > a[i] = b [i]; > } > > you get in the .lim3 dump: > > Unanalyzed memory reference 0: _5 = MEM[symbol: b, index: ivtmp.3_1, step: 4, > offset: 0B]; > Memory reference 1: MEM[symbol: a, index: ivtmp.3_1, step: 4, offset: 0B] > > so the pass analyzes the store but not the load, which seems an oversight. > The patch also folds copy_mem_ref_info into its only user and removes it. > > Tested on x86_64-suse-linux, OK for mainline? Please take the opportunity to clean up simple_mem_ref_in_stmt some more. Both loads and stores in assigns require gimple_assign_single_p, thus do if (!gimple_assign_single_p (stmt)) return NULL; before deciding on store/load. To decide that the stmt is a load then do if (TREE_CODE (*lhs) == SSA_NAME && gimple_vuse (stmt)) it is a store if gimple_vdef (stmt) && (TREE_CODE (*rhs) == SSA_NAME || is_gimple_min_invariant (*rhs)) else it may still be an aggregate copy but LIM doesn't handle those (though they may still be interesting for disambiguation ...) The tree-ssa-address parts are ok as-is. Thanks, Richard. > > 2012-09-29 Eric Botcazou > > * tree.h (copy_mem_ref_info): Delete. > * tree-ssa-address.c (copy_mem_ref_info): Likewise. > (maybe_fold_tmr): Copy flags manually. > * tree-ssa-loop-im.c (simple_mem_ref_in_stmt): Accept TARGET_MEM_REF > on the RHS as well. > > > -- > Eric Botcazou
Re: [PATCH] Fix PR middle-end/54759
On Sun, Sep 30, 2012 at 9:03 PM, Dehao Chen wrote: > Hi, > > This patch fixes the bug when comparing location to UNKNOWN_LOC. > > Bootstrapped and passed gcc regression test. > > Okay for trunk? Ok. Thanks, Richard. > Thanks, > Dehao > > 2012-09-30 Dehao Chen > > PR middle-end/54759 > * gcc/tree-vect-loop-manip.c (slpeel_make_loop_iterate_ntimes): Use > LOCATION_LOCUS to compare with UNKNOWN_LOCATION. > (slpeel_tree_peel_loop_to_edge): Likewise. > * gcc/tree-vectorizer.c (vectorize_loops): Likewise.
Re: RFC: LRA for x86/x86-64 [0/9]
On Mon, Oct 1, 2012 at 7:48 AM, Jakub Jelinek wrote: > On Sun, Sep 30, 2012 at 06:50:50PM -0400, Vladimir Makarov wrote: >> But I think that LRA cpu time problem for this test can be fixed. >> But I don't think I can fix it for 2 weeks. So if people believe >> that current LRA behaviour on this PR is a stopper to include it >> into gcc4.8 than we should postpone its inclusion until gcc4.9 when >> I hope to fix it. > > I think this testcase shouldn't be a show stopper for LRA inclusion into > 4.8, but something to look at for stage3. I agree here. > I think a lot of GCC passes have scalability issues on that testcase, > that is why it must be compiled with -O1 and not higher optimization > options, so perhaps it would be enough to choose a faster algorithm > generating worse code for the huge functions and -O1. Yes, we spent quite some time in making basic optimization work for insane testcases (basically avoid quadratic or bigger complexity in any IL size variable (number of basic-blocks, edges, instructions, pseudos, etc.)). And indeed if you use -O2 we do have issues with existing passes (and even at -O1 points-to analysis can wreck things, or even profile guessing - see existing bugs for that). Basically I would tune -O1 towards being able to compile and optimize insane testcases with memory and compile-time requirements that are linear in any of the above complexity measures. Thus, falling back to the -O0 register allocating strathegy at certain thresholds for the above complexity measures is fine (existing IRA for example has really bad scaling on the number of loops in the function, but you can tweak with flags to make it not consider that). > And I agree it is primarily a bug in the generator that it creates such huge > functions, that can't perform very well. Well, not for -O2+, yes, but at least we should try(!) hard. Thanks, Richard. > Jakub
Re: Profile housekeeping 5/n (make RTL loop optimizers to use loop bounds better)
On Mon, Oct 1, 2012 at 11:37 AM, Jan Hubicka wrote: > Hi, > this patch commonizes the maximal iteration estimate logic in between SCEV and > loop-iv. Both are now using loop->nb_iterations_upper_bound. I decided to > keep same API for SCEV code as for RTL code, so I made > estimated_loop_iterations and max_loop_iterations to not try to recompute > bounds and ICE when invoked without SCEV fired on. > > The patch updates RTL optimizers to use the estimated_loop_iterations and > max_loop_iterations. This has few advantages: > 1) loop unroller can now take into account estimates stored into > loop structure by earlier pass (I think none exist though) > It is however better then using expected_loop_iterations since > profile might get out of date with expansion. > > 2) loop peeling code now use max iterations bounds. This makes it i.e. > to peel vectorizer prologues/epilogues/scalar loops so -fpeel-loops > now improves my low iteration count testcase by about 10% > > 3) Same for loop unswithcing. > > I am not really friend with the new double_int API. I copied some existing > examples but find it ugly. Why do we miss operators for comparsions and > division? Why from_*/to_* can't be a cast at least for basic integer types? > > Regtested/bootstrapped x86_64-linux, seems sane? Yes. Can you add a testcase or two? I tweaked RTL unroll/peel after preserving loops to not blindly unroll/peel everything 8 times (not sure if _I_ added testcases ...). > I also wonder if loop vectorizer should not update the estimates after > loop iteration count is reduced by vectorizing. Probably yes. Thanks, Richard. > Honza > * loop-unswitch.c (unswitch_single_loop): Use > estimated_loop_iterations_int to prevent unswitching when loop > is known to not roll. > * tree-ssa-loop-niter.c (estimated_loop_iterations): Do not segfault > when SCEV is not initialized. > (max_loop_iterations): Likewise. > * tree-ssa-loop-unswitch.c (tree_ssa_unswitch_loops): Use > estimated_loop_iterations_int to prevent unswithcing when > loop is known to not roll. > * tree-scalar-evolution.c (scev_initialized_p): New function. > * tree-scalar-evolution.h (scev_initialized_p): Likewise. > * loop-unroll.c (decide_peel_once_rolling): Use > max_loop_iterations_int. > (unroll_loop_constant_iterations): Update > nb_iterations_upper_bound and nb_iterations_estimate. > (decide_unroll_runtime_iterations): Use > estimated_loop_iterations or max_loop_iterations; > (unroll_loop_runtime_iterations): fix profile updating. > (decide_peel_simple): Use estimated_loop_iterations > and max_loop_iterations. > (decide_unroll_stupid): Use estimated_loop_iterations > ad max_loop_iterations. > * loop-doloop.c (doloop_modify): Use max_loop_iterations_int. > (doloop_optimize): Likewise. > * loop-iv.c (iv_number_of_iterations): Use record_niter_bound. > (find_simple_exit): Likewise. > * cfgloop.h (struct niter_desc): Remove niter_max. > > Index: loop-unswitch.c > === > *** loop-unswitch.c (revision 191867) > --- loop-unswitch.c (working copy) > *** unswitch_single_loop (struct loop *loop, > *** 257,262 > --- 257,263 > rtx cond, rcond = NULL_RTX, conds, rconds, acond, cinsn; > int repeat; > edge e; > + HOST_WIDE_INT iterations; > > /* Do not unswitch too much. */ > if (num > PARAM_VALUE (PARAM_MAX_UNSWITCH_LEVEL)) > *** unswitch_single_loop (struct loop *loop, > *** 299,305 > } > > /* Nor if the loop usually does not roll. */ > ! if (expected_loop_iterations (loop) < 1) > { > if (dump_file) > fprintf (dump_file, ";; Not unswitching, loop iterations < 1\n"); > --- 300,307 > } > > /* Nor if the loop usually does not roll. */ > ! iterations = estimated_loop_iterations_int (loop); > ! if (iterations >= 0 && iterations <= 1) > { > if (dump_file) > fprintf (dump_file, ";; Not unswitching, loop iterations < 1\n"); > Index: tree-ssa-loop-niter.c > === > *** tree-ssa-loop-niter.c (revision 191867) > --- tree-ssa-loop-niter.c (working copy) > *** estimate_numbers_of_iterations_loop (str > *** 3012,3020 > bool > estimated_loop_iterations (struct loop *loop, double_int *nit) > { > ! estimate_numbers_of_iterations_loop (loop); > if (!loop->any_estimate) > ! return false; > > *nit = loop->nb_iterations_estimate; > return true; > --- 3012,3034 > bool > estimated_loop_iterations (struct loop *loop, double_int *nit) > { > ! /* When SCEV information is available, try to update loop iterations > ! estimate. Othe
Re: Constant-fold vector comparisons
On Sat, Sep 29, 2012 at 3:25 PM, Marc Glisse wrote: > Hello, > > this patch does 2 things (I should have split it in 2, but the questions go > together): > > 1) it handles constant folding of vector comparisons, > > 2) it fixes another place where vectors are not expected (I'll probably wait > to have front-end support and testcases to do more of those, but there is > something to discuss). > > I wasn't sure what integer_truep should test exactly. For integer: == 1 or > != 0? For vectors: == -1 or < 0? I chose the one that worked best for the > forwprop case where I used it. > > It seems that before this patch, the middle-end didn't know how comparison > results were encoded (a good reason for VEC_COND_EXPR to require a > comparison as its first argument). I am using the OpenCL encoding that what > matters is the high bit of each vector element. I am not quite sure what > happens for targets (are there any?) that use a different encoding. When > expanding vcond, they can do the comparison as they like. When expanding an > isolated comparison, I expect they have to expand it as vcond(a it should be ok, but I could easily have missed something. Comments below > > 2012-10-01 Marc Glisse > > gcc/ > * tree.c (integer_truep): New function. > * tree.h (integer_truep): Declare. > * tree-ssa-forwprop.c (forward_propagate_into_cond): Call it. > Don't use boolean_type_node for vectors. > * fold-const.c (fold_relational_const): Handle VECTOR_CST. > > gcc/testsuite/ > * gcc.dg/tree-ssa/foldconst-6.c: New testcase. > > -- > Marc Glisse > Index: gcc/tree.h > === > --- gcc/tree.h (revision 191850) > +++ gcc/tree.h (working copy) > @@ -5272,20 +5272,25 @@ extern int integer_zerop (const_tree); > > /* integer_onep (tree x) is nonzero if X is an integer constant of value 1. > */ > > extern int integer_onep (const_tree); > > /* integer_all_onesp (tree x) is nonzero if X is an integer constant > all of whose significant bits are 1. */ > > extern int integer_all_onesp (const_tree); > > +/* integer_truep (tree x) is nonzero if X is an integer constant of value > 1, > + or a vector constant of value < 0. */ > + > +extern bool integer_truep (const_tree); > + > /* integer_pow2p (tree x) is nonzero is X is an integer constant with > exactly one bit 1. */ > > extern int integer_pow2p (const_tree); > > /* integer_nonzerop (tree x) is nonzero if X is an integer constant > with a nonzero value. */ > > extern int integer_nonzerop (const_tree); > > Index: gcc/tree-ssa-forwprop.c > === > --- gcc/tree-ssa-forwprop.c (revision 191850) > +++ gcc/tree-ssa-forwprop.c (working copy) > @@ -564,46 +564,46 @@ forward_propagate_into_cond (gimple_stmt >enum tree_code code; >tree name = cond; >gimple def_stmt = get_prop_source_stmt (name, true, NULL); >if (!def_stmt || !can_propagate_from (def_stmt)) > return 0; > >code = gimple_assign_rhs_code (def_stmt); >if (TREE_CODE_CLASS (code) == tcc_comparison) > tmp = fold_build2_loc (gimple_location (def_stmt), >code, > - boolean_type_node, > + TREE_TYPE (cond), That's obvious. >gimple_assign_rhs1 (def_stmt), >gimple_assign_rhs2 (def_stmt)); >else if ((code == BIT_NOT_EXPR > && TYPE_PRECISION (TREE_TYPE (cond)) == 1) >|| (code == BIT_XOR_EXPR > - && integer_onep (gimple_assign_rhs2 (def_stmt > + && integer_truep (gimple_assign_rhs2 (def_stmt See below. > { > tmp = gimple_assign_rhs1 (def_stmt); > swap = true; > } > } > >if (tmp >&& is_gimple_condexpr (tmp)) > { >if (dump_file && tmp) > { > fprintf (dump_file, " Replaced '"); > print_generic_expr (dump_file, cond, 0); > fprintf (dump_file, "' with '"); > print_generic_expr (dump_file, tmp, 0); > fprintf (dump_file, "'\n"); > } > > - if (integer_onep (tmp)) > + if (integer_truep (tmp)) > gimple_assign_set_rhs_from_tree (gsi_p, gimple_assign_rhs2 (stmt)); >else if (integer_zerop (tmp)) > gimple_assign_set_rhs_from_tree (gsi_p, gimple_assign_rhs3 (stmt)); >else > { > gimple_assign_set_rhs1 (stmt, unshare_expr (tmp)); > if (swap) > { > tree t = gimple_assign_rhs2 (stmt); > gimple_assign_set_rhs2 (stmt, gimple_assign_rhs3 (stmt)); > Index: gcc/testsuite/gcc.dg/tree-ssa/foldconst-6.c > === > --- gcc/testsuite/gcc.dg/tree-ssa/foldconst-6.c (revision 0)
[PATCH] Fix PR47799 - debug info for early-inlining with LTO
This tries to emit proper debug information for early-inlined functions from LTO LTRANS phase (thus, emit DW_TAG_inlined_subroutine and allow gdb to set breakpoints). We need to avoid confusing LTO and dwarf2out with the full abstract block tree, so this patch "flattens" the abstract block tree by always using the ultimate origin for BLOCK_ABSTRACT_ORIGIN on blocks which are inlined_function_outer_scope_p. Thus, it tries to output the minimal info dwarf2out.c needs to emit the desired debug information. As with LTO all abstract inline instances get generated late for early inlined functions I had to amend the "hack" for extern inline functions to always output dies for decls that come their way through dwarf2out_abstract_function. And further down not crash on a NULL DECL_INITIAL (when LTO decided to output the function body in another LTRANS unit or if it does not get output at all). Currently LTO-bootstrapping and testing on x86_64-unknown-linux-gnu. Jason, are the dwarf2out bits ok with you? I've sofar toyed with examples like int x, y; static inline int foo (int i) { y = i; return y; } static inline int bar (int i) { x = i; return foo (x); } int main () { int k = 0; int res = bar (k); return res; } and debug information with/without LTO is now reasonably the same and I can set breakpoints on the inlined instances. Thanks, Richard. 2012-10-01 Richard Guenther PR lto/47788 * tree-streamer-out.c (write_ts_block_tree_pointers): For inlined functions outer scopes write the ultimate origin as BLOCK_ABSTRACT_ORIGIN and BLOCK_SOURCE_LOCATION. Do not stream the fragment chains. (lto_input_ts_block_tree_pointers): Likewise. * dwarf2out.c (gen_subprogram_die): Handle NULL DECL_INITIAL. (dwarf2out_decl): Always output DECL_ABSTRACT function decls. Index: gcc/tree-streamer-in.c === *** gcc/tree-streamer-in.c (revision 191824) --- gcc/tree-streamer-in.c (working copy) *** static void *** 789,810 lto_input_ts_block_tree_pointers (struct lto_input_block *ib, struct data_in *data_in, tree expr) { - /* Do not stream BLOCK_SOURCE_LOCATION. We cannot handle debug information - for early inlining so drop it on the floor instead of ICEing in - dwarf2out.c. */ BLOCK_VARS (expr) = streamer_read_chain (ib, data_in); - /* Do not stream BLOCK_NONLOCALIZED_VARS. We cannot handle debug information - for early inlining so drop it on the floor instead of ICEing in - dwarf2out.c. */ - BLOCK_SUPERCONTEXT (expr) = stream_read_tree (ib, data_in); ! /* Do not stream BLOCK_ABSTRACT_ORIGIN. We cannot handle debug information ! for early inlining so drop it on the floor instead of ICEing in dwarf2out.c. */ ! BLOCK_FRAGMENT_ORIGIN (expr) = stream_read_tree (ib, data_in); ! BLOCK_FRAGMENT_CHAIN (expr) = stream_read_tree (ib, data_in); /* We re-compute BLOCK_SUBBLOCKS of our parent here instead of streaming it. For non-BLOCK BLOCK_SUPERCONTEXTs we still --- 789,810 lto_input_ts_block_tree_pointers (struct lto_input_block *ib, struct data_in *data_in, tree expr) { BLOCK_VARS (expr) = streamer_read_chain (ib, data_in); BLOCK_SUPERCONTEXT (expr) = stream_read_tree (ib, data_in); ! /* Stream BLOCK_ABSTRACT_ORIGIN and BLOCK_SOURCE_LOCATION for ! the limited cases we can handle - those that represent inlined ! function scopes. For the rest them on the floor instead of ICEing in dwarf2out.c. */ ! BLOCK_ABSTRACT_ORIGIN (expr) = stream_read_tree (ib, data_in); ! BLOCK_SOURCE_LOCATION (expr) = lto_input_location (ib, data_in); ! /* Do not stream BLOCK_NONLOCALIZED_VARS. We cannot handle debug information ! for early inlined BLOCKs so drop it on the floor instead of ICEing in ! dwarf2out.c. */ ! ! /* BLOCK_FRAGMENT_ORIGIN and BLOCK_FRAGMENT_CHAIN is not live at LTO ! streaming time. */ /* We re-compute BLOCK_SUBBLOCKS of our parent here instead of streaming it. For non-BLOCK BLOCK_SUPERCONTEXTs we still Index: gcc/tree-streamer-out.c === *** gcc/tree-streamer-out.c (revision 191824) --- gcc/tree-streamer-out.c (working copy) *** write_ts_exp_tree_pointers (struct outpu *** 682,702 static void write_ts_block_tree_pointers (struct output_block *ob, tree expr, bool ref_p) { - /* Do not stream BLOCK_SOURCE_LOCATION. We cannot handle debug information - for early inlining so drop it on the floor instead of ICEing in - dwarf2out.c. */ streamer_write_chain (ob, BLOCK_VARS (expr), ref_p); /* Do not stream BLOCK_NONLOCALIZED_VARS. We cannot handle debug information ! for early inlining so drop i
Re: [RFC] Make vectorizer to skip loops with small iteration estimate
On Sun, 30 Sep 2012, Jan Hubicka wrote: > Hi, > the point of the following patch is to make vectorizer to not vectorize the > following testcase with profile feedback: > > int a[1]; > int i=5; > int k=2; > int val; > __attribute__ ((noinline,noclone)) > test() > { > int j; > for(j=0;j a[j]=val; > } > main() > { > while (i) > { > test (); > i--; > } > } > > Here the compiler should work out that the second loop iterates 2 times at the > average and thus it is not good candidate for vectorizing. > > In my first attempt I added the following: > @@ -1474,6 +1478,18 @@ vect_analyze_loop_operations (loop_vec_i >return false; > } > > + if ((estimated_niter = estimated_stmt_executions_int (loop)) != -1 > + && (unsigned HOST_WIDE_INT) estimated_niter <= th) > +{ > + if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS)) > +fprintf (vect_dump, "not vectorized: estimated iteration count too > small."); > + if (vect_print_dump_info (REPORT_DETAILS)) > +fprintf (vect_dump, "not vectorized: estimated iteration count > smaller than " > + "user specified loop bound parameter or minimum " > + "profitable iterations (whichever is more conservative)."); > + return false; > +} > + > > But to my surprise it does not help. There are two things: > > 1) the value of TH is bit low. In a way the cost model works is that >it finds minimal niters where vectorized loop with all the setup costs >is cheaper than the vector loop with all the setup costs. I.e. > > /* Calculate number of iterations required to make the vector version > profitable, relative to the loop bodies only. The following condition > must hold true: > SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC(A) > where > SIC = scalar iteration cost, VIC = vector iteration cost, > VOC = vector outside cost, VF = vectorization factor, > PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations > SOC = scalar outside cost for run time cost model check. */ > > This value is used for both > 1) decision if number of iterations is too low (max iterations is known) > 2) decision on runtime whether we want to take the vectorized path > or the scalar path. > > The vectoried loop looks like: > k.1_10 = k; > if (k.1_10 > 0) > { > pretmp_2 = val; > niters.8_4 = (unsigned int) k.1_10; > bnd.9_13 = niters.8_4 >> 2; > ratio_mult_vf.10_1 = bnd.9_13 << 2; > _18 = niters.8_4 <= 3; > _19 = ratio_mult_vf.10_1 == 0; > _20 = _19 | _18; > if (_20 != 0) > scalar loop > else > vector prologue > } > > So the unvectorized cost is > SIC * niters > > The vectorized path is > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC > The scalar path of vectorizer loop is > SIC * niters + SOC Note that 'th' is used for the runtime profitability check which is done at the time the setup cost has already been taken (yes, we probably should make it more conservative but then guard the whole set of loops by the check, not only the vectorized path). See PR53355 for the general issue. >It makes sense to vectorize if >SIC * niters > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (B) >That is in the optimal cse where we actually vectorize the overall >speed of vectorized loop including the runtime check is better. > >It makes sense to take the vector loop if >SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (C) >Because the scalar loop is taken. > >The attached patch implements the formula (C) and uses it to deterine the >decision based on number of iterations estimate (that is usually provided > by >the feedback) > >As a reality check, I tried my testcase. > >9: Cost model analysis: > Vector inside of loop cost: 1 > Vector prologue cost: 7 > Vector epilogue cost: 2 > Scalar iteration cost: 1 > Scalar outside cost: 6 > Vector outside cost: 9 > prologue iterations: 0 > epilogue iterations: 2 > Calculated minimum iters for profitability: 4 > >9: Profitability threshold = 3 > >9: Profitability estimated iterations threshold = 20 > >This is overrated. The loop starts to be benefical at about 4 iterations in >reality. I guess the values are kind of wrong. > >Vector inside of loop cost and Scalar iteration cost seems to ignore the >fact that the loops do contain some control flow that should account at > least >one extra cycle. > >Vector prologue cost seems bit overrated for one pack operation. > >Of course this is very simple benchmark, in reality the vectorizatoin can > be >a lot more harmful by complicating more complex control flows. > >So I guess we have two
Re: [patch] experimenting with renumbering of pseudos after expand
On Mon, Oct 1, 2012 at 2:06 PM, Steven Bosscher wrote: > Hello, > > For most code, expand creates a lot of pseudos that are cleaned up in > subsequent passes, if they even live long enough to make it there. On > average, for cc1 preprocessed source, the number of "holes" in > regno_reg_rtx is about half the size of that array, or in other words: > regno_reg_rtx is almost a sparse array. > > I've been experimenting with the attached patch to renumber pseudos > just before initializing DF. For the already-notorious test case of > PR54146 on x86_64, the patch reduces max_reg_num from 348404 to 180502 > for the largest function. This reduces the memory foot print of the > test case to less than 6GB, an almost 25% reduction, and it speeds up > the DF_LR and DF_LIVE problems a bit (~15% but that's not much on the > total compile time). > > I had hopes that it would help for LRA compile times also, but > unfortunately the only significant change is this one: > > without patch: > LRA hard reg assignment : 130.85 (11%) usr 0.20 ( 1%) sys 131.17 > (11%) wall 0 kB ( 0%) ggc > > with patch: > LRA hard reg assignment : 108.92 ( 9%) usr 0.07 ( 0%) sys 109.03 ( > 9%) wall 0 kB ( 0%) ggc > > Anyway, putting the patch out there to show that I've done more than > just complaining ;-) ;) Certainly an interesting idea. I wonder how much pseudos we allocate later during optimizations - thus, would sth like the SSA name freelist (and a way to formally release a pseudo, similar to release_ssa_name) improve the situation? I guess most of the wasted pseudos are created because we are not very good at generating initial RTL - are there maybe a few patterns we can do better on that would catch a big fraction of the waste? Thanks, Richard. > It also would suggest that the scalability challenge for LRA is not > the number of pseudos but something else (number of insns, I guess). > > Ciao! > Steven
[PATCH] Fix -frounding-math builtins
I noticed that we attach the no-vops attribute to -frounding-math math functions. That's bogus as can be seen from the testcase int fesetround(int); double asinh(double x); double foo (double x, int b) { double y = 0.0, z; if (b) y = asinh (x); fesetround (0x400 /*FE_DOWNWARD*/); z = asinh (x); return y + z; } where PRE rightfully so removes a seeming partial redundancy by inserting a asinh call into the else block. That's because it exactly does _not_ get to see the rounding mode clobbering fesetround call as asinh does not have a virtual operand. Fixed as follows. Bootstrapped and tested on x86_64-unknown-linux-gnu. Richard. 2012-10-01 Richard Guenther * builtins.def (ATTR_MATHFN_FPROUNDING): Do not use no-vops with -frounding-math. * builtin-attrs.def (ATTR_PURE_NOTHROW_NOVOPS_LIST): Remove. (ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST): Likewise. Index: gcc/builtins.def === *** gcc/builtins.def(revision 191917) --- gcc/builtins.def(working copy) *** along with GCC; see the file COPYING3. *** 163,169 memory. */ #undef ATTR_MATHFN_FPROUNDING #define ATTR_MATHFN_FPROUNDING (flag_rounding_math ? \ ! ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST : ATTR_CONST_NOTHROW_LEAF_LIST) /* Define an attribute list for math functions that are normally "impure" because some of them may write into global memory for --- 163,169 memory. */ #undef ATTR_MATHFN_FPROUNDING #define ATTR_MATHFN_FPROUNDING (flag_rounding_math ? \ ! ATTR_PURE_NOTHROW_LEAF_LIST : ATTR_CONST_NOTHROW_LEAF_LIST) /* Define an attribute list for math functions that are normally "impure" because some of them may write into global memory for Index: gcc/builtin-attrs.def === *** gcc/builtin-attrs.def (revision 191917) --- gcc/builtin-attrs.def (working copy) *** DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_LI *** 127,136 ATTR_NULL, ATTR_NOTHROW_LIST) DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_LEAF_LIST, ATTR_PURE, \ ATTR_NULL, ATTR_NOTHROW_LEAF_LIST) - DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_NOVOPS_LIST, ATTR_NOVOPS, \ - ATTR_NULL, ATTR_PURE_NOTHROW_LIST) - DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST, ATTR_NOVOPS,\ - ATTR_NULL, ATTR_PURE_NOTHROW_LEAF_LIST) DEF_ATTR_TREE_LIST (ATTR_NORETURN_NOTHROW_LIST, ATTR_NORETURN,\ ATTR_NULL, ATTR_NOTHROW_LIST) DEF_ATTR_TREE_LIST (ATTR_NORETURN_NOTHROW_LEAF_LIST, ATTR_NORETURN,\ --- 127,132
Re: [PATCH] Add option for dumping to stderr (issue6190057)
On Mon, Oct 1, 2012 at 3:55 PM, Sharad Singhai wrote: > On Mon, Oct 1, 2012 at 6:52 AM, H.J. Lu wrote: >> On Mon, Oct 1, 2012 at 6:49 AM, Sharad Singhai wrote: >>> I am sorry, I didn't enable all the languages. Will fix the fortran >>> test breakage shortly. >> >> It is not just Fortran. There are some failures in C testcases. > > I checked and those files looked like generator files for Fortran > tests and thus were not exercised in my configuration. I am really > sorry about that. I am fixing it. As I said, you should not enable/disable anything special but configure with all default languages enabled (no --enable-languages) and do toplevel make -k check, preferably also excercising multilibs with RUNTESTFLAGS="--target_board=unix/\{,-m32\}" Richard. > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11.c > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11a.c > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11b.c > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11c.c > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-2.c > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-25.c > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-26.c > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-28.c > UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-32.c > > Thanks, > Sharad > >> >>> Thanks, >>> Sharad >>> Sharad >>> >>> >>> On Mon, Oct 1, 2012 at 4:50 AM, H.J. Lu wrote: On Sun, Sep 30, 2012 at 11:36 PM, Sharad Singhai wrote: > Resend to gcc-patches > > I have addressed the comments by fixing all the minor issues, > bootstrapped and tested on x86_64. I did the recommended reshuffling > by moving non-tree code from tree-dump.c into a new file dumpfile.c. > > I committed two successive revisions > r191883 Main patch with the dump infrastructure changes. However, I > accidentally left out a new file, dumpfile.c. > r191884 Added dumpfile.c, and did the renaming of dump_* functions > from gimple_pretty_print.[ch]. > > As things stand right now, r191883 is broken because of the missing > file 'dumpfile.c', which the very next commit fixes. Anyone who got > broken revision r191883, please svn update. I am really very sorry > about that. > > I have a couple more minor patches which deal with renaming; I plan to > address those later. > It caused: FAIL: gcc.dg/tree-ssa/gen-vect-11.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-11.c scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gcc.dg/tree-ssa/gen-vect-11a.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-11a.c scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gcc.dg/tree-ssa/gen-vect-11b.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-11b.c scan-tree-dump-times vect "vectorized 0 loops" 1 FAIL: gcc.dg/tree-ssa/gen-vect-11c.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-11c.c scan-tree-dump-times vect "vectorized 0 loops" 1 FAIL: gcc.dg/tree-ssa/gen-vect-2.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-2.c scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gcc.dg/tree-ssa/gen-vect-25.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-25.c scan-tree-dump-times vect "vectorized 2 loops" 1 FAIL: gcc.dg/tree-ssa/gen-vect-26.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-26.c scan-tree-dump-times vect "Alignment of access forced using peeling" 1 FAIL: gcc.dg/tree-ssa/gen-vect-26.c scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gcc.dg/tree-ssa/gen-vect-28.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-28.c scan-tree-dump-times vect "Alignment of access forced using peeling" 1 FAIL: gcc.dg/tree-ssa/gen-vect-28.c scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gcc.dg/tree-ssa/gen-vect-32.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-32.c scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gfortran.dg/vect/O3-pr36119.f90 (test for excess errors) FAIL: gfortran.dg/vect/O3-pr39595.f (test for excess errors) FAIL: gfortran.dg/vect/Ofast-pr50414.f90 (test for excess errors) FAIL: gfortran.dg/vect/cost-model-pr34445.f (test for excess errors) FAIL: gfortran.dg/vect/cost-model-pr34445a.f (test for excess errors) FAIL: gfortran.dg/vect/fast-math-pr38968.f90 (test for excess errors) FAIL: gfortran.dg/vect/fast-math-pr38968.f90 scan-tree-dump vect "vectorized 1 loops" FAIL: gfortran.dg/vect/fast-math-real8-pr40801.f90 (test for excess errors) FAIL: gfortran.dg/vect/fast-math-real8-pr40801.f90 (test for excess errors) FAIL: gfortran.dg/vect/fast-math-vect-8.f90 (test for excess errors) FAIL: gfortran.dg/vect/fast-math-vect-8.f90 scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gfortran.dg/vect/no-fre-no-copy-prop-O3-pr51704.f90 (test for excess errors) FAIL: gfortran.dg/vect/no-vfa-pr32377.f90 (test for excess er
Re: [RFC] Make vectorizer to skip loops with small iteration estimate
On Mon, 1 Oct 2012, Jan Hubicka wrote: > > > > > > So the unvectorized cost is > > > SIC * niters > > > > > > The vectorized path is > > > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC > > > The scalar path of vectorizer loop is > > > SIC * niters + SOC > > > > Note that 'th' is used for the runtime profitability check which is > > done at the time the setup cost has already been taken (yes, we > > Yes, I understand that. > > probably should make it more conservative but then guard the whole > > set of loops by the check, not only the vectorized path). > > See PR53355 for the general issue. > > Yep, we may reduce the cost of SOC by outputting early guard for > non-vectorized > path better than we do now. However... > > >Of course this is very simple benchmark, in reality the vectorizatoin > > > can be > > >a lot more harmful by complicating more complex control flows. > > > > > >So I guess we have two options > > > 1) go with the new formula and try to make cost model a bit more > > > realistic. > > > 2) stay with original formula that is quite close to reality, but I > > > think > > >more by an accident. > > > > I think we need to improve it as whole, thus I'd prefer 2). > > ... I do not see why. > Even if we make the check cheaper we will only distribute part of SOC to > vector > prologues/epilogues. > > Still I think the formula is wrong, I.e. accounting SOC where it should not. > > The cost of scalar path without vectorization is > niters * SIC > while with vectorization we have scalar path > niters * SIC + SOC > and vector path > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC > > So SOC cancels out in the runtime check. > I still think we need two formulas - one determining if vectorization is > profitable, other specifying the threshold for scalar path at runtime (that > will generally give lower values). True, we want two values. But part of the scalar path right now is all the computation required for alias and alignment runtime checks (because the way all the conditions are combined). I'm not much into the details of what we account for in SOC (I suppose it's everything we insert in the preheader of the vector loop). + if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS)) +fprintf (vect_dump, "not vectorized: estimated iteration count too small."); + if (vect_print_dump_info (REPORT_DETAILS)) +fprintf (vect_dump, "not vectorized: estimated iteration count smaller than " + "user specified loop bound parameter or minimum " + "profitable iterations (whichever is more conservative)."); this won't work anymore btw - dumping infrastructure changed. I suppose your patch is a step in the right direction, but to really make progress we need to re-organize the loop and predicate structure produced by the vectorizer. So, please update your patch, re-test and then it's ok. > > > 2) Even when loop iterates 2 times, it is estimated to 4 iterations by > > >estimated_stmt_executions_int with the profile feedback. > > >The reason is loop_ch pass. Given a rolled loop with exit probability > > >30%, proceeds by duplicating the header with original probabilities. > > >This makes the loop to be executed with 60% probability. Because the > > >loop body counts remain the same (and they should), the expected number > > >of iterations increase by the decrease of entry edge to the header. > > > > > >I wonder what to do about this. Obviously without path profiling > > >loop_ch can not really do a good job. We can artifically make > > >header to suceed more likely, that is the reality, but that requires > > >non-trivial loop profile updating. > > > > > >We can also simply record the iteration bound into loop structure > > >and ignore that the profile is not realistic > > > > But we don't preserve loop structure from header copying ... > > From what time we keep loop structure? In general I would like to eventualy > drop value histograms to loop structure specifying number of iterations with > profile feedback. We preserve it from the start of the tree loop optimizers (it's easy to preserve them from earlier points as long as you don't cross inlining, but to lower the impact of the change I placed it where it was enough to prevent the excessive unrolling/peeling done by RTL) > > > > >Finally we can duplicate loop headers before profilng. I implemented > > >that via early_ch pass executed only with profile generation or > > > feedback. > > >I guess it makes sense to do, even if it breaks the assumption that > > >we should do strictly -Os generation on paths where > > > > Well, there are CH cases that do not increase code size and I doubt > > that loop header copying is generally bad for -Os ... we are not > > good at handling non-copied loop headers. > > There is comment saying > /* Loop
Re: Convert more non-GTY htab_t to hash_table.
On Mon, 1 Oct 2012, Lawrence Crowl wrote: > Change more non-GTY hash tables to use the new type-safe template hash table. > Constify member function parameters that can be const. > Correct a couple of expressions in formerly uninstantiated templates. > > The new code is 0.362% faster in bootstrap, with a 99.5% confidence of > being faster. > > Tested on x86-64. > > Okay for trunk? You are changing a hashtable used by fold checking, did you test with fold checking enabled? +/* Data structures used to maintain mapping between basic blocks and + copies. */ +static hash_table bb_original; +static hash_table bb_copy; note that because hash_table has a constructor we now get global CTORs for all statics :( (and mx-protected local inits ...) Can you please try to remove the constructor from hash_table to avoid this overhead? (as a followup - that is, don't initialize htab) The cfg.c, dse.c and hash-table.h parts are ok for trunk, I'll leave the rest to respective maintainers of the pieces of the compiler. Thanks, Richard. > > Index: gcc/java/ChangeLog > > 2012-10-01 Lawrence Crowl > > * Make-lang.in (JAVA_OBJS): Add dependence on hash-table.o. > (JCFDUMP_OBJS): Add dependence on hash-table.o. > (jcf-io.o): Add dependence on hash-table.h. > * jcf-io.c (memoized_class_lookups): Change to use type-safe hash table. > > Index: gcc/c/ChangeLog > > 2012-10-01 Lawrence Crowl > > * Make-lang.in (c-decl.o): Add dependence on hash-table.h. > * c-decl.c (detect_field_duplicates_hash): Change to new type-safe > hash table. > > Index: gcc/objc/ChangeLog > > 2012-10-01 Lawrence Crowl > > * Make-lang.in (OBJC_OBJS): Add dependence on hash-table.o. > (objc-act.o): Add dependence on hash-table.h. > * objc-act.c (objc_detect_field_duplicates): Change to new type-safe > hash table. > > Index: gcc/ChangeLog > > 2012-10-01 Lawrence Crowl > > * Makefile.in (fold-const.o): Add depencence on hash-table.h. > (dse.o): Likewise. > (cfg.o): Likewise. > * fold-const.c (fold_checksum_tree): Change to new type-safe hash table. > * (print_fold_checksum): Likewise. > * cfg.c (var bb_original): Likewise. > * (var bb_copy): Likewise. > * (var loop_copy): Likewise. > * hash-table.h (template hash_table): Constify parameters for find... > and remove_elt... member functions. > (hash_table::empty) Correct size expression. > (hash_table::clear_slot) Correct deleted entry assignment. > * dse.c (var rtx_group_table): Change to new type-safe hash table. > > Index: gcc/cp/ChangeLog > > 2012-10-01 Lawrence Crowl > > * Make-lang.in (class.o): Add dependence on hash-table.h. > (tree.o): Likewise. > (semantics.o): Likewise. > * class.c (fixed_type_or_null): Change to new type-safe hash table. > * tree.c (verify_stmt_tree): Likewise. > (verify_stmt_tree_r): Likewise. > * semantics.c (struct nrv_data): Likewise. > > > Index: gcc/java/Make-lang.in > === > --- gcc/java/Make-lang.in (revision 191941) > +++ gcc/java/Make-lang.in (working copy) > @@ -83,10 +83,10 @@ JAVA_OBJS = java/class.o java/decl.o jav >java/zextract.o java/jcf-io.o java/win32-host.o java/jcf-parse.o > java/mangle.o \ >java/mangle_name.o java/builtins.o java/resource.o \ >java/jcf-depend.o \ > - java/jcf-path.o java/boehm.o java/java-gimplify.o > + java/jcf-path.o java/boehm.o java/java-gimplify.o hash-table.o > > JCFDUMP_OBJS = java/jcf-dump.o java/jcf-io.o java/jcf-depend.o > java/jcf-path.o \ > - java/win32-host.o java/zextract.o ggc-none.o > + java/win32-host.o java/zextract.o ggc-none.o hash-table.o > > JVGENMAIN_OBJS = java/jvgenmain.o java/mangle_name.o > > @@ -326,7 +326,7 @@ java/java-gimplify.o: java/java-gimplify > # jcf-io.o needs $(ZLIBINC) added to cflags. > CFLAGS-java/jcf-io.o += $(ZLIBINC) > java/jcf-io.o: java/jcf-io.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \ > - $(JAVA_TREE_H) java/zipfile.h > + $(JAVA_TREE_H) java/zipfile.h $(HASH_TABLE_H) > > # jcf-path.o needs a -D. > CFLAGS-java/jcf-path.o += \ > Index: gcc/java/jcf-io.c > === > --- gcc/java/jcf-io.c (revision 191941) > +++ gcc/java/jcf-io.c (working copy) > @@ -31,7 +31,7 @@ The Free Software Foundation is independ > #include "jcf.h" > #include "tree.h" > #include "java-tree.h" > -#include "hashtab.h" > +#include "hash-table.h" > #include > > #include "zlib.h" > @@ -271,20 +271,34 @@ find_classfile (char *filename, JCF *jcf >return open_class (filename, jcf, fd, dep_name); > } > > -/* Returns 1 if the CLASSNAME (really a char *) matches the name > - stored in TABLE_ENTRY (also a char *). */ > > -static int > -memoized_class_lookup_eq (const void *table_entry, const void *classname) > +/* Ha
Re: [PATCH] Fix powerpc breakage, was: Add option for dumping to stderr (issue6190057)
On Tue, Oct 2, 2012 at 1:11 AM, Xinliang David Li wrote: > On Mon, Oct 1, 2012 at 4:05 PM, Sharad Singhai wrote: >> Thanks for tracking down and fixing the powerpc port. >> >> The "dump_kind_p ()" check is redundant but canonical form here. I >> think blocks of dump code guarded by "if dump_kind_p (...)" might be >> easier to read/maintain. >> > > I find it confusing to be honest. The redundant check serves no purpose. The check should be inlined and avoid the call to the diagnostic routine, thus speed up compile-time. We should use this pattern, especially if it guards multiple calls. Richard. > David > >> Sharad >> Sharad >> >> >> On Mon, Oct 1, 2012 at 3:45 PM, Xinliang David Li wrote: >>> On Mon, Oct 1, 2012 at 2:37 PM, Michael Meissner >>> wrote: I tracked down some of the other code that previously used REPORT_DETAILS, and MSG_NOTE is the new way to do the same thing. This bootstraps and no unexpected errors occur during make check. Is it ok to install? 2012-10-01 Michael Meissner * config/rs6000/rs6000.c (toplevel): Include dumpfile.h. (rs6000_density_test): Rework to accomidate 09-30 change by Sharad Singhai. * config/rs6000/t-rs6000 (rs6000.o): Add dumpfile.h dependency. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 191932) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -58,6 +58,7 @@ #include "tm-constrs.h" #include "opts.h" #include "tree-vectorizer.h" +#include "dumpfile.h" #if TARGET_XCOFF #include "xcoffout.h" /* get declarations of xcoff_*_section_name */ #endif @@ -3518,11 +3519,11 @@ rs6000_density_test (rs6000_cost_data *d && vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD) { data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100; - if (vect_print_dump_info (REPORT_DETAILS)) - fprintf (vect_dump, -"density %d%%, cost %d exceeds threshold, penalizing " -"loop body cost by %d%%", density_pct, -vec_cost + not_vec_cost, DENSITY_PENALTY); + if (dump_kind_p (MSG_NOTE)) >>> >>> Is this check needed? Seems redundant. >>> >>> David >>> >>> + dump_printf_loc (MSG_NOTE, vect_location, +"density %d%%, cost %d exceeds threshold, penalizing " +"loop body cost by %d%%", density_pct, +vec_cost + not_vec_cost, DENSITY_PENALTY); } } Index: gcc/config/rs6000/t-rs6000 === --- gcc/config/rs6000/t-rs6000 (revision 191932) +++ gcc/config/rs6000/t-rs6000 (working copy) @@ -26,7 +26,7 @@ rs6000.o: $(CONFIG_H) $(SYSTEM_H) corety $(OBSTACK_H) $(TREE_H) $(EXPR_H) $(OPTABS_H) except.h function.h \ output.h dbxout.h $(BASIC_BLOCK_H) toplev.h $(GGC_H) $(HASHTAB_H) \ $(TM_P_H) $(TARGET_H) $(TARGET_DEF_H) langhooks.h reload.h gt-rs6000.h \ - cfgloop.h $(OPTS_H) $(COMMON_TARGET_H) + cfgloop.h $(OPTS_H) $(COMMON_TARGET_H) dumpfile.h rs6000-c.o: $(srcdir)/config/rs6000/rs6000-c.c \ $(srcdir)/config/rs6000/rs6000-protos.h \ -- Michael Meissner, IBM 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA meiss...@linux.vnet.ibm.com fax +1 (978) 399-6899
Re: [PATCH] Fix test breakage, was: Add option for dumping to stderr (issue6190057)
On Tue, Oct 2, 2012 at 1:31 AM, Sharad Singhai wrote: > Here is a patch to fix test breakage caused by r191883. Bootstrapped > on x86_64 and tested with > make -k check RUNTESTFLAGS="--target_board=unix/\{,-m32\}". > > Okay for trunk? Ok. Thanks, Richard. > Thanks, > Sharad > > 2012-10-01 Sharad Singhai > > * tree-vect-stmts.c (vectorizable_operation): Add missing return. > > testsuite/Changelog > > * gfortran.dg/vect/vect.exp: Change verbose vectorizor dump options > to fix test failures caused by r191883. > * gcc.dg/tree-ssa/gen-vect-11.c: Likewise. > * gcc.dg/tree-ssa/gen-vect-2.c: Likewise. > * gcc.dg/tree-ssa/gen-vect-32.c: Likewise. > * gcc.dg/tree-ssa/gen-vect-25.c: Likewise. > * gcc.dg/tree-ssa/gen-vect-11a.c: Likewise. > * gcc.dg/tree-ssa/gen-vect-26.c: Likewise. > * gcc.dg/tree-ssa/gen-vect-11b.c: Likewise. > * gcc.dg/tree-ssa/gen-vect-11c.c: Likewise. > * gcc.dg/tree-ssa/gen-vect-28.c: Likewise. > * testsuite/gcc.target/i386/vect-double-1.c: Fix test. Missing entry > from r191883. > > > Index: testsuite/gfortran.dg/vect/vect.exp > === > --- testsuite/gfortran.dg/vect/vect.exp (revision 191883) > +++ testsuite/gfortran.dg/vect/vect.exp (working copy) > @@ -26,7 +26,7 @@ set DEFAULT_VECTCFLAGS "" > > # These flags are used for all targets. > lappend DEFAULT_VECTCFLAGS "-O2" "-ftree-vectorize" "-fno-vect-cost-model" \ > - "-ftree-vectorizer-verbose=4" "-fdump-tree-vect-stats" > + "-fdump-tree-vect-details" > > # If the target system supports vector instructions, the default action > # for a test is 'run', otherwise it's 'compile'. Save current default. > Index: testsuite/gcc.dg/tree-ssa/gen-vect-11.c > === > --- testsuite/gcc.dg/tree-ssa/gen-vect-11.c (revision 191883) > +++ testsuite/gcc.dg/tree-ssa/gen-vect-11.c (working copy) > @@ -1,6 +1,6 @@ > /* { dg-do run { target vect_cmdline_needed } } */ > -/* { dg-options "-O2 -ftree-vectorize -ftree-vectorizer-verbose=3 > -fwrapv -fdump-tree-vect-stats" } */ > -/* { dg-options "-O2 -ftree-vectorize -ftree-vectorizer-verbose=3 > -fwrapv -fdump-tree-vect-stats -mno-sse" { target { i?86-*-* > x86_64-*-* } } } */ > +/* { dg-options "-O2 -ftree-vectorize -fwrapv -fdump-tree-vect-details" } */ > +/* { dg-options "-O2 -ftree-vectorize -fwrapv > -fdump-tree-vect-details -mno-sse" { target { i?86-*-* x86_64-*-* } } > } */ > > #include > > Index: testsuite/gcc.dg/tree-ssa/gen-vect-2.c > === > --- testsuite/gcc.dg/tree-ssa/gen-vect-2.c (revision 191883) > +++ testsuite/gcc.dg/tree-ssa/gen-vect-2.c (working copy) > @@ -1,6 +1,6 @@ > /* { dg-do run { target vect_cmdline_needed } } */ > -/* { dg-options "-O2 -ftree-vectorize -ftree-vectorizer-verbose=4 > -fdump-tree-vect-stats" } */ > -/* { dg-options "-O2 -ftree-vectorize -ftree-vectorizer-verbose=4 > -fdump-tree-vect-stats -mno-sse" { target { i?86-*-* x86_64-*-* } } } > */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details > -mno-sse" { target { i?86-*-* x86_64-*-* } } } */ > > #include > > Index: testsuite/gcc.dg/tree-ssa/gen-vect-32.c > === > --- testsuite/gcc.dg/tree-ssa/gen-vect-32.c (revision 191883) > +++ testsuite/gcc.dg/tree-ssa/gen-vect-32.c (working copy) > @@ -1,6 +1,6 @@ > /* { dg-do run { target vect_cmdline_needed } } */ > -/* { dg-options "-O2 -ftree-vectorize -ftree-vectorizer-verbose=4 > -fdump-tree-vect-stats" } */ > -/* { dg-options "-O2 -ftree-vectorize -ftree-vectorizer-verbose=4 > -fdump-tree-vect-stats -mno-sse" { target { i?86-*-* x86_64-*-* } } } > */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details > -mno-sse" { target { i?86-*-* x86_64-*-* } } } */ > > #include > > Index: testsuite/gcc.dg/tree-ssa/gen-vect-25.c > === > --- testsuite/gcc.dg/tree-ssa/gen-vect-25.c (revision 191883) > +++ testsuite/gcc.dg/tree-ssa/gen-vect-25.c (working copy) > @@ -1,6 +1,6 @@ > /* { dg-do run { target vect_cmdline_needed } } */ > -/* { dg-options "-O2 -ftree-vectorize -ftree-vectorizer-verbose=4 > -fdump-tree-vect-stats" } */ > -/* { dg-options "-O2 -ftree-vectorize -ftree-vectorizer-verbose=4 > -fdump-tree-vect-stats -mno-sse" { target { i?86-*-* x86_64-*-* } } } > */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details > -mno-sse" { target { i?86-*-* x86_64-*-* } } } */ > > #include > > Index: testsuite/gcc.dg/tree-ssa/gen-vect-11a.c > ===
Re: [PATCH] Add option for dumping to stderr (issue6190057)
On Mon, Oct 1, 2012 at 8:39 PM, Gabriel Dos Reis wrote: > On Mon, Oct 1, 2012 at 1:27 PM, Michael Meissner > wrote: >> On Mon, Oct 01, 2012 at 02:02:26PM -0400, Michael Meissner wrote: >>> Your change on September 30th, breaks the powerpc port because the >>> REPORT_DETAILS value in the enumeration is no longer there, and the >>> rs6000_density_test function was using that. Please in the future, when you >>> are making global changes, grep for uses of enum values in all of the >>> machine >>> dependent directories so we can avoid breakage like this. >> >> Also, in looking at the changes, given we are already up to 28 TDF_ flags, I >> would recommend immediately adding a new type that is the TDF flagword type. >> Thus it will be a lot simpler when we add 4 more TDF flags and have to change >> the type from int to HOST_WIDE_INT. > > Agreed that we need an abstraction here. Some TLC as well - the flags have various meanings (some control dumping, some, like TDF_TREE, seem to be unrelated - the MSG ones probably don't need the same number-space as well, not all flags are used anymore - TDF_MEMSYMS?). But yes, an abstraction is needed. But I wouldn't suggest HOST_WIDE_INT but int -> uint32_t instead (possibly going uint64_t). Richard. > -- Gaby
Re: [Patch] Fix PR53397
On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: > Hi, > > The below patch fixes the FFT/Scimark regression caused by useless prefetch > generation. > > This fix tries to make prefetch less aggressive by prefetching arrays in the > inner loop, when the step is invariant in the entire loop nest. > > GCC currently tries to prefetch invariant steps when they are in the inner > loop. But does not check if the step is variant in outer loops. > > In the scimark FFT case, the trip count of the inner loop varies by a non > constant step, which is invariant in the inner loop. > But the step variable is varying in outer loop. This makes > inner loop trip count small (at run time varies sometimes as small as 1 > iteration) > > Prefetching ahead x iteration when the inner loop trip count is smaller than x > leads to useless prefetches. > > Flag used: -O3 -march=amdfam10 > > Before > ** ** > ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** > ** for details. (Results can be submitted to p...@nist.gov) ** > ** ** > Using 2.00 seconds min time per kenel. > Composite Score: 550.50 > FFT Mflops:38.66(N=1024) > SOR Mflops: 617.61(100 x 100) > MonteCarlo: Mflops: 173.74 > Sparse matmult Mflops: 675.63(N=1000, nz=5000) > LU Mflops: 1246.88(M=100, N=100) > > > After > ** ** > ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** > ** for details. (Results can be submitted to p...@nist.gov) ** > ** ** > Using 2.00 seconds min time per kenel. > Composite Score: 639.20 > FFT Mflops: 479.19(N=1024) > SOR Mflops: 617.61(100 x 100) > MonteCarlo: Mflops: 173.18 > Sparse matmult Mflops: 679.13(N=1000, nz=5000) > LU Mflops: 1246.88(M=100, N=100) > > GCC regression "make check -k" passes with x86_64-unknown-linux-gnu > New tests that PASS: > > gcc.dg/pr53397-1.c scan-assembler prefetcht0 > gcc.dg/pr53397-1.c scan-tree-dump aprefetch "Issued prefetch" > gcc.dg/pr53397-1.c (test for excess errors) > gcc.dg/pr53397-2.c scan-tree-dump aprefetch "loop variant step" > gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching" > gcc.dg/pr53397-2.c (test for excess errors) > > > Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted. > > Ok to commit in trunk? > > regards, > Venkat > > gcc/ChangeLog > +2012-10-01 Venkataramanan Kumar > + > + * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$ > + Perform non constant step prefetching in inner loop, only $ > + when it is invariant in the entire loop nest. $ > + * testsuite/gcc.dg/pr53397-1.c: New test case $ > + Checks we are prefecthing for loop invariant steps$ > + * testsuite/gcc.dg/pr53397-2.c: New test case$ > + Checks we are not prefecthing for loop variant steps > + > > > Index: gcc/testsuite/gcc.dg/pr53397-1.c > === > --- gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) > +++ gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) > @@ -0,0 +1,28 @@ > +/* Prefetching when the step is loop invariant. */ > + > +/* { dg-do compile } */ > +/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details > --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 > -fdump-tree-aprefetch-details" } */ > + > + > +double data[16384]; > +void prefetch_when_non_constant_step_is_invariant(int step, int n) > +{ > + int a; > + int b; > + for (a = 1; a < step; a++) { > +for (b = 0; b < n; b += 2 * step) { > + > + int i = 2*(b + a); > + int j = 2*(b + a + step); > + > + > + data[j] = data[i]; > + data[j+1] = data[i+1]; > +} > + } > +} > + > +/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */ > +/* { dg-final { scan-assembler "prefetcht0" } } */ This (and the case below) needs to be adjusted to only run on the appropriate hardware. See for example gcc.dg/tree-ssa/prefetch-8.c for how to do this. > +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ > Index: gcc/testsuite/gcc.dg/pr53397-2.c > === > --- gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) > +++ gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) > @@ -0,0 +1,29 @@ > +/* Not prefetching when the step is loop variant. */ > + > +/* { dg-do compile } */ > +/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details > --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 > -fdump-tree-aprefetch-details" } */ > + > + > +double data[16384]; > +void donot_prefetch_when_n
Re: vec_cond_expr adjustments
On Mon, Oct 1, 2012 at 5:57 PM, Marc Glisse wrote: > [merging both threads, thanks for the answers] > > > On Mon, 1 Oct 2012, Richard Guenther wrote: > >>>> optabs should be fixed instead, an is_gimple_val condition is >>>> implicitely >>>> val != 0. >>> >>> >>> For vectors, I think it should be val < 0 (with an appropriate cast of >>> val >>> to a signed integer vector type if necessary). Or (val & highbit) != 0, >>> but >>> that's longer. >> >> >> I don't think so. Throughout the compiler we generally assume false == 0 >> and anything else is true. (yes, for FP there is STORE_FLAG_VALUE, but >> it's scope is quite limited - if we want sth similar for vectors we'd have >> to >> invent it). > > > See below. > > >>>> If we for example have >>>> >>>> predicate = a < b; >>>> x = predicate ? d : e; >>>> y = predicate ? f : g; >>>> >>>> we ideally want to re-use the predicate computation on targets where >>>> that would be optimal (and combine should be able to recover the >>>> case where it is not). >>> >>> >>> That I don't understand. The vcond instruction implemented by targets >>> takes >>> as arguments d, e, cmp, a, b and emits the comparison itself. I don't see >>> how I can avoid sending to the targets both (d,e,<,a,b) and (f,g,<,a,b). >>> They will notice eventually that a>> the >>> two, but I don't see how to do that in optabs.c. Or I can compute x = a < >>> b, >>> use x < 0 as the comparison passed to the targets, and expect targets >>> (those >>> for which it is true) to recognize that < 0 is useless in a vector >>> condition >>> (PR54700), or is useless on a comparison result. >> >> >> But that's a limitation of how vcond works. ISTR there is/was a vselect >> instruction as well, taking a "mask" and two vectors to select from. At >> least >> that's how vcond works internally for some sub-targets. > > > vselect seems to only appear in config/. Would it be defined as: > vselect(m,a,b)=(a&m)|(b&~m) ? I would almost be tempted to just define a > pattern in .md files and let combine handle it, although it might be one > instruction too long for that (and if m is x=y). > Or would it match the OpenCL select: "For each component of a vector type, > result[i] = if MSB of c[i] is set ? b[i] : a[i]."? Or the pattern with & > and | but with a precondition that the value of each element of the mask > must be 0 or ±1? > > I don't find vcond that bad, as long as targets check for trivial > comparisons in the expansion (what trivial means may depend on the > platform). It is quite flexible for targets. Well, ok. > > On Mon, 1 Oct 2012, Richard Guenther wrote: > >>> tmp = fold_build2_loc (gimple_location (def_stmt), >>>code, >>> - boolean_type_node, >>> + TREE_TYPE (cond), >> >> >> That's obvious. > > > Ok, I'll test and commit that line separately. > >>> + if (TREE_CODE (op0) == VECTOR_CST && TREE_CODE (op1) == VECTOR_CST) >>> +{ >>> + int count = VECTOR_CST_NELTS (op0); >>> + tree *elts = XALLOCAVEC (tree, count); >>> + gcc_assert (TREE_CODE (type) == VECTOR_TYPE); >>> + >>> + for (int i = 0; i < count; i++) >>> + { >>> + tree elem_type = TREE_TYPE (type); >>> + tree elem0 = VECTOR_CST_ELT (op0, i); >>> + tree elem1 = VECTOR_CST_ELT (op1, i); >>> + >>> + elts[i] = fold_relational_const (code, elem_type, >>> + elem0, elem1); >>> + >>> + if(elts[i] == NULL_TREE) >>> + return NULL_TREE; >>> + >>> + elts[i] = fold_negate_const (elts[i], elem_type); >> >> >> I think you need to invent something new similar to STORE_FLAG_VALUE >> or use STORE_FLAG_VALUE here. With the above you try to map >> {0, 1} to {0, -1} which is only true if the operation on the element types >> returns {0, 1} (thus, STORE_FLAG_VALUE is 1). > > > Er, seems to me that constant folding of a scalar comparison in the > front/middle-end only returns {0, 1}. The point is we need to define some semantics for vector comparison
Re: [PATCH] Vector CONSTRUCTOR verifier
On Tue, Oct 2, 2012 at 3:01 PM, Jakub Jelinek wrote: > Hi! > > As discussed in the PR and on IRC, this patch verifies that vector > CONSTRUCTOR in GIMPLE is either empty CONSTRUCTOR, or contains scalar > elements of type compatible with vector element type (then the verification > is less strict, allows less than TYPE_VECTOR_SUBPARTS elements and allows > non-NULL indexes if they are consecutive (no holes); this is because > from FEs often CONSTRUCTORs with those properties leak into the IL, and > a change in the gimplifier to canonicalize them wasn't enough, they keep > leaking even from non-gimplified DECL_INITIAL values etc.), or > contains vector elements (element types must be compatible, the vector > elements must be of the same type and their number must fill the whole > wider vector - these are created/used by tree-vect-generic lowering if > HW supports only shorter vectors than what is requested in source). > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Ok with ... > 2012-10-02 Jakub Jelinek > > PR tree-optimization/54713 > * expr.c (categorize_ctor_elements_1): Don't assume purpose is > non-NULL. > * tree-cfg.c (verify_gimple_assign_single): Add verification of > vector CONSTRUCTORs. > * tree-ssa-sccvn.c (vn_reference_lookup_3): For VECTOR_TYPE > CONSTRUCTORs, don't do anything if element type is VECTOR_TYPE, > and don't check index. > * tree-vect-slp.c (vect_get_constant_vectors): VIEW_CONVERT_EXPR > ctor elements first if their type isn't compatible with vector > element type. > > --- gcc/expr.c.jj 2012-09-27 12:45:53.0 +0200 > +++ gcc/expr.c 2012-10-01 18:21:40.885122833 +0200 > @@ -5491,7 +5491,7 @@ categorize_ctor_elements_1 (const_tree c > { >HOST_WIDE_INT mult = 1; > > - if (TREE_CODE (purpose) == RANGE_EXPR) > + if (purpose && TREE_CODE (purpose) == RANGE_EXPR) > { > tree lo_index = TREE_OPERAND (purpose, 0); > tree hi_index = TREE_OPERAND (purpose, 1); > --- gcc/tree-cfg.c.jj 2012-10-01 17:28:17.469921927 +0200 > +++ gcc/tree-cfg.c 2012-10-02 11:24:11.686155889 +0200 > @@ -4000,6 +4000,80 @@ verify_gimple_assign_single (gimple stmt >return res; > > case CONSTRUCTOR: > + if (TREE_CODE (rhs1_type) == VECTOR_TYPE) > + { > + unsigned int i; > + tree elt_i, elt_v, elt_t = NULL_TREE; > + > + if (CONSTRUCTOR_NELTS (rhs1) == 0) > + return res; > + /* For vector CONSTRUCTORs we require that either it is empty > +CONSTRUCTOR, or it is a CONSTRUCTOR of smaller vector elements > +(then the element count must be correct to cover the whole > +outer vector and index must be NULL on all elements, or it is > +a CONSTRUCTOR of scalar elements, where we as an exception allow > +smaller number of elements (assuming zero filling) and > +consecutive indexes as compared to NULL indexes (such > +CONSTRUCTORs can appear in the IL from FEs). */ > + FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (rhs1), i, elt_i, elt_v) > + { > + if (elt_t == NULL_TREE) > + { > + elt_t = TREE_TYPE (elt_v); > + if (TREE_CODE (elt_t) == VECTOR_TYPE) > + { > + tree elt_t = TREE_TYPE (elt_v); > + if (!useless_type_conversion_p (TREE_TYPE (rhs1_type), > + TREE_TYPE (elt_t))) > + { > + error ("incorrect type of vector CONSTRUCTOR" > +" elements"); > + debug_generic_stmt (rhs1); > + return true; > + } > + else if (CONSTRUCTOR_NELTS (rhs1) > + * TYPE_VECTOR_SUBPARTS (elt_t) > + != TYPE_VECTOR_SUBPARTS (rhs1_type)) > + { > + error ("incorrect number of vector CONSTRUCTOR" > +" elements"); > + debug_generic_stmt (rhs1); > + return true; > + } > + } > + else if (!useless_type_conversion_p (TREE_TYPE (rhs1_type), > + elt_t)) > + { > + error ("incorrect type of vector CONSTRUCTOR elements"); > + debug_generic_stmt (rhs1); > + return true; > + } > + else if (CONSTRUCTOR_NELTS (rhs1) > + > TYPE_VECTOR_SUBPARTS (rhs1_type)) > + { > + error ("incorrect number of vector CONSTRUCTOR > elements"); > +
[PATCH] Fix PR54735
This fixes PR54735 - a bad interaction of non-up-to-date virtual SSA form, update-SSA and cfg cleanup. Morale of the story: cfg cleanup can remove blocks and thus release SSA names - SSA update is rightfully confused when such released SSA name is still used at update time. The following patch fixes the case in question simply by making sure to run update-SSA before cfg cleanup. [eventually release_ssa_name could treat virtual operands the same as regular SSA names when they are scheduled for a re-write, that would make this whole mess more robust - I am thinking of this] Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. Richard. 2012-10-02 Richard Guenther PR middle-end/54735 * tree-ssa-pre.c (do_pre): Make sure to update virtual SSA form before cleaning up the CFG. * g++.dg/torture/pr54735.C: New testcase. Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 191969) +++ gcc/tree-ssa-pre.c (working copy) @@ -4820,6 +4820,13 @@ do_pre (void) free_scc_vn (); + /* Tail merging invalidates the virtual SSA web, together with + cfg-cleanup opportunities exposed by PRE this will wreck the + SSA updating machinery. So make sure to run update-ssa + manually, before eventually scheduling cfg-cleanup as part of + the todo. */ + update_ssa (TODO_update_ssa_only_virtuals); + return todo; } @@ -4845,8 +4852,7 @@ struct gimple_opt_pass pass_pre = 0, /* properties_provided */ 0, /* properties_destroyed */ TODO_rebuild_alias, /* todo_flags_start */ - TODO_update_ssa_only_virtuals | TODO_ggc_collect - | TODO_verify_ssa /* todo_flags_finish */ + TODO_ggc_collect | TODO_verify_ssa /* todo_flags_finish */ } }; Index: gcc/testsuite/g++.dg/torture/pr54735.C === --- gcc/testsuite/g++.dg/torture/pr54735.C (revision 0) +++ gcc/testsuite/g++.dg/torture/pr54735.C (working copy) @@ -0,0 +1,179 @@ +// { dg-do compile } + +class Gmpfr +{}; +class M : Gmpfr +{ +public: + Gmpfr infconst; + M(int); +}; +templatestruct A; +templateclass N; +templateclass O; +templatestruct B; +struct C +{ + enum + { value }; +}; +class D +{ +public: + enum + { ret }; +}; +struct F +{ + enum + { ret = 0 ? : 0 }; +}; +templatestruct G +{ + typedef Otype; +}; +struct H +{ + void operator * (); +}; +struct I +{ + enum + { RequireInitialization = C::value ? : 0, ReadCost }; +}; +templatestruct J +{ + enum + { ret = A::InnerStrideAtCompileTime }; +}; +templatestruct K +{ + enum + { ret = A::OuterStrideAtCompileTime }; +}; +templateclass P : H +{ +public: + using H::operator *; + typedef typename A::Scalar Scalar; + enum + { RowsAtCompileTime= + A::RowsAtCompileTime, ColsAtCompileTime = + A::ColsAtCompileTime, SizeAtCompileTime = + F::ret, MaxRowsAtCompileTime = + A::MaxRowsAtCompileTime, MaxColsAtCompileTime = + A::MaxColsAtCompileTime, MaxSizeAtCompileTime = + F::ret, Flags = + A::Flags ? : 0 ? : 0, CoeffReadCost = + A::CoeffReadCost, InnerStrideAtCompileTime= + J::ret, OuterStrideAtCompileTime = K::ret }; + B operator << (const Scalar&); +}; + +templateclass O : public P +{}; + +templateclass L +{ +public: + + int cols() + { +return _Cols; + } +}; +templateclass Q : public G::type +{ +public: + typedef typename G::type Base; + typedef typename A::Index Index; + typedef typename A::Scalar Scalar; + L m_storage; + Index cols() + { +return m_storage.cols(); + } + + Scalar& coeffRef(Index, + Index); +}; + +templatestruct A > +{ + typedef _Scalar Scalar; + typedef int Index; + enum + { RowsAtCompileTime, ColsAtCompileTime = + _Cols, MaxRowsAtCompileTime, MaxColsAtCompileTime, Flags= + D::ret, CoeffReadCost = + I::ReadCost, InnerStrideAtCompileTime, OuterStrideAtCompileTime = + 0 ? : 0 }; +}; +templateclass N : public Q > +{ +public: + Q Base; + templateN(const T0&, + const T1&); +}; +void +__assert_fail(int) +throw() __attribute__((__noreturn__)); +templatestruct B +{ + typedef typename XprType::Scalar Scalar; + typedef typename XprType::Index Index; + B(XprType & p1, const Scalar &) : m_xpr(p1), m_col(), +m_currentBlockRows(1) + {} B& operator, (const Scalar&) + { +Index a; + +if (m_col == m_xpr.cols()) +{ + m_col = 0; + m_currentBlockRows = 1; + a && "Too " ? static_cast(0) : _
Re: Convert more non-GTY htab_t to hash_table.
On Tue, 2 Oct 2012, Lawrence Crowl wrote: > On 10/2/12, Richard Guenther wrote: > > On Mon, 1 Oct 2012, Lawrence Crowl wrote: > > > Change more non-GTY hash tables to use the new type-safe > > > template hash table. Constify member function parameters that > > > can be const. Correct a couple of expressions in formerly > > > uninstantiated templates. > > > > > > The new code is 0.362% faster in bootstrap, with a 99.5% > > > confidence of being faster. > > > > > > Tested on x86-64. > > > > > > Okay for trunk? > > > > You are changing a hashtable used by fold checking, did you test > > with fold checking enabled? > > I didn't know I had to do anything beyond the normal make check. > What do I do? > > > +/* Data structures used to maintain mapping between basic blocks and > > + copies. */ > > +static hash_table bb_original; > > +static hash_table bb_copy; > > > > note that because hash_table has a constructor we now get global > > CTORs for all statics :( (and mx-protected local inits ...) > > The overhead for the global constructors isn't significant. > Only the function-local statics have mx-protection, and that can > be eliminated by making them global static. > > > Can you please try to remove the constructor from hash_table to > > avoid this overhead? (as a followup - that is, don't initialize > > htab) > > The initialization avoids potential errors in calling dispose. > I can do it, but I don't think the overhead (after moving the > function-local statics to global) will matter, and so I prefer to > keep the safety. So is the move of the statics sufficient or do > you still want to remove constructors? Hm, having them in-scope where they are used is good style. Why can't they be statically initialized and put in .data? Please make it so - you know C++ enough (ISTR value-initialization is default - which means NULL for the pointer?) Richard. > > The cfg.c, dse.c and hash-table.h parts are ok for trunk, I'll > > leave the rest to respective maintainers of the pieces of the > > compiler. > > > > Thanks, > > Richard. > > > >> > >> Index: gcc/java/ChangeLog > >> > >> 2012-10-01 Lawrence Crowl > >> > >>* Make-lang.in (JAVA_OBJS): Add dependence on hash-table.o. > >>(JCFDUMP_OBJS): Add dependence on hash-table.o. > >>(jcf-io.o): Add dependence on hash-table.h. > >>* jcf-io.c (memoized_class_lookups): Change to use type-safe hash table. > >> > >> Index: gcc/c/ChangeLog > >> > >> 2012-10-01 Lawrence Crowl > >> > >>* Make-lang.in (c-decl.o): Add dependence on hash-table.h. > >>* c-decl.c (detect_field_duplicates_hash): Change to new type-safe > >>hash table. > >> > >> Index: gcc/objc/ChangeLog > >> > >> 2012-10-01 Lawrence Crowl > >> > >>* Make-lang.in (OBJC_OBJS): Add dependence on hash-table.o. > >>(objc-act.o): Add dependence on hash-table.h. > >>* objc-act.c (objc_detect_field_duplicates): Change to new type-safe > >>hash table. > >> > >> Index: gcc/ChangeLog > >> > >> 2012-10-01 Lawrence Crowl > >> > >>* Makefile.in (fold-const.o): Add depencence on hash-table.h. > >>(dse.o): Likewise. > >>(cfg.o): Likewise. > >>* fold-const.c (fold_checksum_tree): Change to new type-safe hash table. > >>* (print_fold_checksum): Likewise. > >>* cfg.c (var bb_original): Likewise. > >>* (var bb_copy): Likewise. > >>* (var loop_copy): Likewise. > >>* hash-table.h (template hash_table): Constify parameters for find... > >>and remove_elt... member functions. > >> (hash_table::empty) Correct size expression. > >> (hash_table::clear_slot) Correct deleted entry assignment. > >>* dse.c (var rtx_group_table): Change to new type-safe hash table. > >> > >> Index: gcc/cp/ChangeLog > >> > >> 2012-10-01 Lawrence Crowl > >> > >>* Make-lang.in (class.o): Add dependence on hash-table.h. > >>(tree.o): Likewise. > >>(semantics.o): Likewise. > >>* class.c (fixed_type_or_null): Change to new type-safe hash table. > >>* tree.c (verify_stmt_tree): Likewise. > >>(verify_stmt_tree_r): Likewise. > >>* semantics.c (struct nrv_data): Likewise. > >> > >> > >> Index:
Re: [PATCH] Fix up DW_TAG_formal_parameter placement
On Wed, 3 Oct 2012, Jakub Jelinek wrote: > Hi! > > With the PR54519 patch I've just posted, I've noticed, I've noticed on the > same testcase from yesterday's IRC: > static inline void foo (int x, int y) { asm volatile ("nop"); } > static inline void bar (int z) { foo (z, 0); foo (z, 1); } > int main () > { > bar (0); > bar (1); > return 0; > } > that while I can print x and y just fine, if I do bt, x, y and z printed > in the backtrace are all optimized out. > The problem is that first tree versioning for foo.isra.* or bar.isra.* > deposits the optimized away parameters as VAR_DECLs into the DECL_INITIAL > block (which is fine), but then during inlining they end up in the remapped > block of DECL_INITIAL, not the new block added above it into which inliner > puts parameters. So in the debug info we have > DW_TAG_inlined_subroutine > DW_TAG_formal_parameter for non-optimized away parameters > DW_TAG_lexical_block > DW_TAG_formal_parameter for optimized away parameters > and the debugger (expectably) looks only at DW_TAG_inlined_subroutine > DIE's immediate children for the formal parameters to print during > backtrace. > Fixed up by moving the VAR_DECLs for parameters optimized away by versioning > to BLOCK_SUPERCONTEXT during inlining, at that point we know both of the > blocks have the same scope, and if the original DECL_INITIAL doesn't contain > any other vars, we can actually end up with shorter/more correct debug info > as well as memory savings due to being able to GC the remapped DECL_INITIAL > block. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? This looks like the wrong place to fix things to me ... either we can fix this at the point we create the VAR_DECLs for the optimized away PARM_DECLs (or we should delay that until here?) or we fix it up in dwarf2out.c (how does this fix interact with stabs and the other debuginfo formats? mentioning DWARF in tree-inline looks odd, unless we get rid of the other formats - something I'd of course welcome ;)) Thanks, Richard. > 2012-10-03 Jakub Jelinek > > * tree-inline.c (expand_call_inline): Move VAR_DECLs with > PARM_DECL origins from remapped DECL_INITIAL's BLOCK_VARS > into id->block's BLOCK_VARS. > > --- gcc/tree-inline.c.jj 2012-10-02 17:43:13.0 +0200 > +++ gcc/tree-inline.c 2012-10-02 19:43:52.576382413 +0200 > @@ -3946,7 +3946,29 @@ expand_call_inline (basic_block bb, gimp >initialize_inlined_parameters (id, stmt, fn, bb); > >if (DECL_INITIAL (fn)) > -prepend_lexical_block (id->block, remap_blocks (DECL_INITIAL (fn), id)); > +{ > + tree *var; > + > + prepend_lexical_block (id->block, remap_blocks (DECL_INITIAL (fn), > id)); > + gcc_checking_assert (BLOCK_SUBBLOCKS (id->block) > +&& (BLOCK_CHAIN (BLOCK_SUBBLOCKS (id->block)) > +== NULL_TREE)); > + /* Move vars for PARM_DECLs from DECL_INITIAL block to id->block, > + otherwise DW_TAG_formal_parameter will not be children of > + DW_TAG_inlined_subroutine, but of a DW_TAG_lexical_block > + under it. The parameters can be then evaluated in the debugger, > + but don't show in backtraces. */ > + for (var = &BLOCK_VARS (BLOCK_SUBBLOCKS (id->block)); *var; ) > + if (TREE_CODE (DECL_ORIGIN (*var)) == PARM_DECL) > + { > + tree v = *var; > + *var = TREE_CHAIN (v); > + TREE_CHAIN (v) = BLOCK_VARS (id->block); > + BLOCK_VARS (id->block) = v; > + } > + else > + var = &TREE_CHAIN (*var); > +} > >/* Return statements in the function body will be replaced by jumps > to the RET_LABEL. */ > > > Jakub > > -- Richard Biener SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imend
Re: RFC: Using DECL_NO_LIMIT_STACK as a backend specific flag
On Wed, Oct 3, 2012 at 5:20 PM, nick clifton wrote: > Hi Ian, > > >> Can't you just keep a list of the decls for which you have issued the >> warning? > > > Yes - that would work too. In fact I agree that this would be cleaner > solution in my particular case. I'll create a new patch... > > > >> How many decls do you expect to be on that list? > > > Not very many. Maybe two or three at most. > > But I am interested to know if there is a way for targets to add their own > information to function decls (and decls in general). If not for this > particular case, then for problems to come in the future. To the decl not. But I suppose you are only looking at defined functions, so it should suffice to amend struct function. It's reasonable to allow the target to associate extra info with struct function and we already have a way to do that via the init_machine_status target hook. Richard. > Cheers > Nick > >
Re: RFA: add lock_length attribute to break branch-shortening cycles
On Wed, Oct 3, 2012 at 8:22 PM, Joern Rennecke wrote: > The ARCompact architecture has some pipelining features that result in > the vanilla branch shortening not always converging. > > Moreover, there are some short, complex branch instructions with very short > offsets; replacing them with a multi-insn sequence when the offset doesn't > reach makes the code significantly longer. Thus, when starting branch > shortening with pessimistic assumptions, the short branches are often not > used because of the pessimistic branch length causing the offsets going out > of range. > This problem can be avoided when starting with a low estimate and working > upwards. However, that makes the incidence of infinite branch shortening > cycles higher, and also makes it impossible to just break out after some > iteration count. > > To address these issues, I've made the generator programs recognize the > optional lock_length attribute. > > To quote from the documentation added for this feature: > > If you define the `lock_length' attribute, branch shortening will work > the other way round: it starts out assuming minimum instruction lengths > and iterates from there. In addition, the value of the `lock_length' > attribute does not decrease across iterations, and the value computed > for the `length' attribute will be no smaller than that of the > `lock_length' attribute. I miss a few things in this description: - what is the value of lock_length supposed to be? From the "lock" prefix it sounds like it is something unchanging, maybe even constant, thus a maximum? - the length attribute still needs to be specified when lock_length is? how do they relate? Is lock_length always smaller / bigger than length? - what happens if you have patterns with lock_length and patterns without? - what patterns does lock_length apply to? In general optimistically attacking this kind of problem should be always better - did you try simply switching this for all targets? It shouldn't be slower and the only thing you need to guarantee is that during iteration you never make insn-lenghts smaller again. Richard. > bootstrapped and regression tested on i686-pc-linux-gnu > > 2012-10-03 Joern Rennecke > > * final.c (get_attr_length_1): Use direct recursion rather than > calling get_attr_length. > (get_attr_lock_length): New function. > (INSN_VARIABLE_LENGTH_P): Define. > (shorten_branches): Take HAVE_ATTR_lock_length into account. > Don't overwrite non-delay slot insn lengths with the lengths of > delay slot insns with same uid. > * genattrtab.c (lock_length_str): New variable. > (make_length_attrs): New parameter base. > (main): Initialize lock_length_str. > Generate lock_lengths attributes. > * genattr.c (gen_attr): Emit declarations for lock_length attribute > related functions. > * doc/md.texi (node Insn Lengths): Document lock_length attribute. > > Index: doc/md.texi > === > --- doc/md.texi (revision 192036) > +++ doc/md.texi (working copy) > @@ -8004,6 +8004,20 @@ (define_insn "jump" >(const_int 6)))]) > @end smallexample > > +@cindex lock_length > +Usually, branch shortening is done assuming the worst case (i.e. longest) > +lengths, and then iterating (if optimizing) to smaller lengths till > +no further changed occur. This does not work so well for architectures > +that have very small minimum offsets and considerable jumps in instruction > +lengths. > + > +If you define the @code{lock_length} attribute, branch shortening will > +work the other way round: it starts out assuming minimum instruction > +lengths and iterates from there. In addition, the value of the > +@code{lock_length} attribute does not decrease across iterations, and > +the value computed for the @code{length} attribute will be no smaller > +than that of the @code{lock_length} attribute. > + > @end ifset > @ifset INTERNALS > @node Constant Attributes > Index: final.c > === > --- final.c (revision 192036) > +++ final.c (working copy) > @@ -312,6 +312,7 @@ dbr_sequence_length (void) > `insn_current_length'. */ > > static int *insn_lengths; > +static char *uid_lock_length; > > VEC(int,heap) *insn_addresses_; > > @@ -447,6 +448,20 @@ get_attr_length (rtx insn) >return get_attr_length_1 (insn, insn_default_length); > } > > +#ifdef HAVE_ATTR_lock_length > +int > +get_attr_lock_length (rtx insn) > +{ > + if (uid_lock_length && insn_lengths_max_uid > INSN_UID (insn)) > +return uid_lock_length[INSN_UID (insn)]; > + return get_attr_length_1 (insn, insn_min_lock_length); > +} > +#define INSN_VARIABLE_LENGTH_P(INSN) \ > + (insn_variable_length_p (INSN) || insn_variable_lock_length_p (INSN)) > +#else > +#define INSN_VARIABLE_LENGTH_P(INSN) (insn_variable_length_p (INSN)) > +#e
Re: opts.c, gcc.c: Plug some memory leaks - and an out-of-bounds memory access
On Wed, Oct 3, 2012 at 11:01 PM, Tobias Burnus wrote: > Found using http://scan5.coverity.com/ > > Build on x86-64-gnu-linux with C/C++/Fortran. I will now do an all-language > build/regtest. > OK when it passes? > > (Note to the save_string call: I reduced it by 2: The "+1" in the call makes > it long (out of bounds) and the "+1" in temp_filename_length is not needed > (but also doesn't harm) as "tmp" is null terminated and save_string adds > another '\0' after copying "len" bytes.) - prefix = concat (target_sysroot_suffix, prefix, NULL); - prefix = concat (sysroot_no_trailing_dir_separator, prefix, NULL); + { + char *tmp; + tmp = concat (target_sysroot_suffix, prefix, NULL); + prefix = concat (sysroot_no_trailing_dir_separator, tmp, NULL); + free (tmp); + } prefix = concat (sysroot_no_trailing_dir_separator, target_sysroot_suffix, prefix, NULL); should be equivalent and easier to read, no? + else + prefix = concat (sysroot_no_trailing_dir_separator, prefix, NULL); + btw, we're not careing too much about memleaks in the driver ... Otherwise the patch looks ok with the above change. Thanks, Richard. > Tobias
Re: [Ping]FW: [PATCH] Cilk Plus merging to trunk (2 of n)
On Thu, Oct 4, 2012 at 2:22 AM, Iyer, Balaji V wrote: > Hi Joseph, > Did you get a chance to look at this submission? I think I have fixed > all the changes you have mentioned. Is it OK for trunk? > > Thanks, > > Balaji V. Iyer. > >>-Original Message- >>From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- >>ow...@gcc.gnu.org] On Behalf Of Iyer, Balaji V >>Sent: Wednesday, September 26, 2012 7:16 PM >>To: Joseph Myers >>Cc: gcc-patches@gcc.gnu.org; al...@redhat.com; r...@redhat.com; >>l...@redhat.com >>Subject: RE: [PATCH] Cilk Plus merging to trunk (2 of n) >> >>Hello Joseph, >> In my last patch, I forgot to add the change Richard Guenther wanted me >>to make. He wanted me to move the ARRAY_NOTATION_REF node from tree.def >>to c-family/c-common.def. Here is a new one that has this change. I am sorry >>for >>this. >> >>Here are ChangeLog entries: >> >>gcc/ChangeLog >>2012-09-26 Balaji V. Iyer >> >>* tree.h (array_notation_reduce_type): Added new enumerator. This should be moved to c-tree.h then, and ... >>* Makefile.in (OBJS): Added array-notation-common.o. >>* doc/passes.texi (Cilk Plus Transformation): Documented array >>notation and overall transformations for Cilk Plus. >>* doc/invoke.texi (C Dialect Options): Documented -fcilkplus flag. >>* doc/generic.texi (Storage References): Documented >>ARRAY_NOTATION_REF >>tree addition. >>* tree-pretty-pretty.c (dump_generic_node): Added ARRAY_NOTATION_REF >>case. ... this to c-pretty-print.c and >>* array-notation-common.c: New file. ... this to the c-common/ directory. Basically this should be a completely frontend-only patch. Richard. >>gcc/c-family/ChangeLog >>2012-09-26 Balaji V. Iyer >> >>* c-common.h (build_array_notation_expr): New function declaration. >>(ARRAY_NOTATION_ARRAY): Added new #define. >>(ARRAY_NOTATION_CHECK): Likewise. >>(ARRAY_NOTATION_START): Likewise. >>(ARRAY_NOTATION_LENGTH): Likewise. >>(ARRAY_NOTATION_STRIDE): Likewise. >>(ARRAY_NOTATION_TYPE): Likewise. >>* c-common.def: Added new tree ARRAY_NOTATION_REF. >>* c-common.c (c_define_builtins): Added a call to initialize array >>notation builtin functions. >>(c_common_init_ts): Set ARRAY_NOTATION_REF as typed. >>* c.opt (-fcilkplus): Define new command line switch. >> >>gcc/c/ChangeLog >>2012-09-26 Balaji V. Iyer >> >>* c-typeck.c (convert_arguments): Added a check if tree contains >>array notation expressions before throwing errors or doing anything. >>* Make-lang.in (C_AND_OBJC_OBJS): Added c-array-notation.o. >>* c-parser.c (c_parser_compound_statement): Check if array notation >> code >>is used in tree, if so, then transform them into appropriate C code. >>(c_parser_expr_no_commas): Check if array notation is used in LHS or >>RHS, if so, then build array notation expression instead of regular >>modify. >>(c_parser_postfix_expression_after_primary): Added a check for >> colon(s) >>after square braces, if so then handle it like an array notation. >> Also, >>break up array notations in unary op if found. >>(c_parser_array_notation): New function. >>* c-array-notation.c: New file. >> >>gcc/testsuite/ChangeLog >>2012-09-26 Balaji V. Iyer >> >>* gcc.dg/cilk-plus/array_notation/execute/execute.exp: New script. >>* gcc.dg/cilk-plus/array_notation/compile/compile.exp: Likewise. >>* gcc.dg/cilk-plus/array_notation/errors/errors.exp: Likewise. >>* gcc.dg/cilk-plus/array_notation/execute/sec_implicit_ex.c: New test. >>* gcc.dg/cilk-plus/array_notation/execute/if_test.c: Likewise. >>* gcc.dg/cilk-plus/array_notation/execute/gather_scatter.c: Likewise. >>* gcc.dg/cilk-plus/array_notation/execute/builtin_func_double2.c: >>Likewise. >>* gcc.dg/cilk-plus/array_notation/execute/builtin_func_double.c: >>Likewise. >>* gcc.dg/cilk-plus/array_notation/execute/builtin_fn_custom.c: >> Likewise. >>* gcc.dg/cilk-plus/array_notation/execute/builtin_fn_mutating.c: >>Likewise. >>* gcc.dg/cilk-plus/array_notation/execute/array_test_ND.c: Likewise. >>* gcc.dg/cilk-plus/array_notation/execute/array_test2.c: Likewise. >>
Re: Fix -fdump-ada-spec
On Thu, Oct 4, 2012 at 10:26 AM, Arnaud Charlet wrote: > After changes by Sharad (Add option for dumping to stderr (issue6190057)), > -fdump-ada-spec is broken, and is now a no-op. > > Admittedly, this is because -fdump-ada-spec is handled differently from > other -fdump-* switches, so this patch fixes support for -fdump-ada-spec > by using an approach similar to -fdump-go-spec, and use regular switches > via c/c.opt. I've removed the handling of TDF_RAW, which was a debugging > option, and never really used, so can be simply deleted. > > Change is mostly trivial/mechanical. > > Tested on x86_64-pc-linux-gnu, OK for trunk? Much cleaner indeed. Ok, Thanks, Richard. > gcc/ > > 2012-10-04 Arnaud Charlet > > * dumpfile.h, dumpfile.c: Remove TDI_ada. > > c-family/ > > 2012-10-04 Arnaud Charlet > > * c-ada-spec.c (print_ada_declaration): Remove handling of TDF_RAW. > * c.opt (-fdump-ada-spec, -fdump-ada-spec-slim): Move switch > definition > out of dumpfile.h. > > c/ > > 2012-10-04 Arnaud Charlet > > * c-decl.c (c_write_global_declarations): Fix handling of > -fdump-ada-spec*. > > cp/ > > 2012-10-04 Arnaud Charlet > > * decl2.c (cp_write_global_declarations): Fix handling of > -fdump-ada-spec*. > > Arno > -- > Index: c-family/c.opt > === > --- c-family/c.opt (revision 192062) > +++ c-family/c.opt (working copy) > @@ -799,6 +799,14 @@ fdollars-in-identifiers > C ObjC C++ ObjC++ > Permit '$' as an identifier character > > +fdump-ada-spec > +C ObjC C++ ObjC++ RejectNegative Var(flag_dump_ada_spec) > +Write all declarations as Ada code transitively > + > +fdump-ada-spec-slim > +C ObjC C++ ObjC++ RejectNegative Var(flag_dump_ada_spec_slim) > +Write all declarations as Ada code for the given file only > + > felide-constructors > C++ ObjC++ Var(flag_elide_constructors) Init(1) > > Index: c-family/c-ada-spec.c > === > --- c-family/c-ada-spec.c (revision 192062) > +++ c-family/c-ada-spec.c (working copy) > @@ -2535,7 +2535,6 @@ print_ada_declaration (pretty_printer *b >int is_class = false; >tree name = TYPE_NAME (TREE_TYPE (t)); >tree decl_name = DECL_NAME (t); > - bool dump_internal = get_dump_file_info (TDI_ada)->pflags & TDF_RAW; >tree orig = NULL_TREE; > >if (cpp_check && cpp_check (t, IS_TEMPLATE)) > @@ -2705,8 +2704,7 @@ print_ada_declaration (pretty_printer *b > } >else > { > - if (!dump_internal > - && TREE_CODE (t) == VAR_DECL > + if (TREE_CODE (t) == VAR_DECL > && decl_name > && *IDENTIFIER_POINTER (decl_name) == '_') > return 0; > @@ -2796,8 +2794,7 @@ print_ada_declaration (pretty_printer *b > >/* If this function has an entry in the dispatch table, we cannot > omit it. */ > - if (!dump_internal && !DECL_VINDEX (t) > - && *IDENTIFIER_POINTER (decl_name) == '_') > + if (!DECL_VINDEX (t) && *IDENTIFIER_POINTER (decl_name) == '_') > { > if (IDENTIFIER_POINTER (decl_name)[1] == '_') > return 0; > Index: c/c-decl.c > === > --- c/c-decl.c (revision 192062) > +++ c/c-decl.c (working copy) > @@ -10079,10 +10079,10 @@ c_write_global_declarations (void) >gcc_assert (!current_scope); > >/* Handle -fdump-ada-spec[-slim]. */ > - if (dump_initialized_p (TDI_ada)) > + if (flag_dump_ada_spec || flag_dump_ada_spec_slim) > { >/* Build a table of files to generate specs for */ > - if (get_dump_file_info (TDI_ada)->pflags & TDF_SLIM) > + if (flag_dump_ada_spec_slim) > collect_source_ref (main_input_filename); >else > for_each_global_decl (collect_source_ref_cb); > Index: cp/decl2.c > === > --- cp/decl2.c (revision 192062) > +++ cp/decl2.c (working copy) > @@ -3698,9 +3698,9 @@ cp_write_global_declarations (void) >cgraph_process_same_body_aliases (); > >/* Handle -fdump-ada-spec[-slim] */ > - if (dump_initialized_p (TDI_ada)) > + if (flag_dump_ada_spec || flag_dump_ada_spec_slim) > { > - if (get_dump_file_info (TDI_ada)->pflags & TDF_SLIM) > + if (flag_dump_ada_spec_slim) > collect_source_ref (main_input_filename); >else > collect_source_refs (global_namespace); > Index: dumpfile.c > === > --- dumpfile.c (revision 192062) > +++ dumpfile.c (working copy) > @@ -57,8 +57,7 @@ static struct dump_file_info dump_files[ > 0, 0, 0, 5}, >{".vcg", "tree-vcg", NULL, NULL, NULL, NULL, NULL, TDF_TREE, > 0, 0, 0, 6}, > - {".ads", "ada-spec", NULL, NULL, NULL, NULL, NULL, 0, 0, 0, 0, 7}, > -#define FIRST_AUTO_NUMBERED_DUMP 8 > +#define FIRST_AUTO_NUMBERED_DU
Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])
On Thu, Oct 4, 2012 at 11:43 AM, Steven Bosscher wrote: > On Wed, Oct 3, 2012 at 5:35 PM, Steven Bosscher wrote: >> The "worst" result is this: >> Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, >> post_count 12483414 >> >> But that's still a lot better than before the patch for the same function: >> Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, >> post_count 12479992 > > Walking basic blocks with FOR_EACH_BB_REVERSE gives: > > Only FOR_EACH_BB_REVERSE: > Compressing live ranges: from 1742579 to 429746 - 24% pre_count > 41106212, post_count 34376494 > Compressing live ranges: from 1742569 to 63000 - 3% pre_count > 40835340, post_count 11055747 > > FOR_EACH_BB_REVERSE + need_curr_point_incr: > Compressing live ranges: from 726184 to 416529 - 57% pre_count > 40743516, post_count 34376846 > Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806, > post_count 11055747 > > The combination of the two changes takes ~20s off the ~180s for "LRA > create live ranges". Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order? Thus, doesn't the above show there exists an optimal order for processing which we could use? (I realize _REVERSE is a simple solution, but might there not exist a pathological case where _REVERSE is even worse than non-_REVERSE?) Richard. > Ciao! > Steven
Re: patch to fix
On Wed, Oct 3, 2012 at 7:15 PM, Kenneth Zadeck wrote: > The enclosed patch is the third of at least four patches that fix the > problems associated with supporting integers on the target that are > wider than two HOST_WIDE_INTs. > > While GCC claims to support OI mode, and we have two public ports that > make minor use of this mode, in practice, compilation that uses OImode > mode commonly gives the wrong result or ices. We have a private port > of GCC for an architecture that is further down the road to needing > comprehensive OImode and we have discovered that this is unusable. We > have decided to fix it in a general way that so that it is most > beneficial to the GCC community. It is our belief that we are just a > little ahead of the X86 and the NEON and these patches will shortly be > essential. > > The first two of these patches were primarily lexigraphical and have > already been committed.They transformed the uses of CONST_DOUBLE > so that it is easy to tell what the intended usage is. > > The underlying structures in the next two patches are very general: > once they are added to the compiler, the compiler will be able to > support targets with any size of integer from hosts of any size > integer. > > The patch enclosed deals with the portable RTL parts of the compiler. > The next patch, which is currently under construction deals with the > tree level. However, this patch can be put on the trunk as is, and it > will eleviate many, but not all of the current limitations in the rtl > parts of the compiler. > > Some of the patch is conditional, depending on a port defining the > symbol 'TARGET_SUPPORTS_WIDE_INT' to be non zero. Defining this > symbol to be non zero is declaring that the port has been converted to > use the new form or integer constants. However, the patch is > completely backwards compatible to allow ports that do not need this > immediately to convert at their leasure. The conversion process is > not difficult, but it does require some knowledge of the port, so we > are not volinteering to do this for all ports. > > OVERVIEW OF THE PATCH: > > The patch defines a new datatype, a 'wide_int' (defined in > wide-int.[ch], and this datatype will be used to perform all of the > integer constant math in the compiler. Externally, wide-int is very > similar to double-int except that it does not have the limitation that > math must be done on exactly two HOST_WIDE_INTs. > > Internally, a wide_int is a structure that contains a fixed sized > array of HOST_WIDE_INTs, a length field and a mode. The size of the That it has a mode sounds odd to me and makes it subtly different from HOST_WIDE_INT and double-int. Maybe the patch will tell why this is so. > array is determined at generation time by dividing the number of bits > of the largest integer supported on the target by the number of bits > in a HOST_WIDE_INT of the host. Thus, with this format, any length of > integer can be supported on any host. > > A new rtx type is created, the CONST_WIDE_INT, which contains a > garbage collected array of HOST_WIDE_INTS that is large enough to hold > the constant. For the targets that define TARGET_SUPPORTS_WIDE_INT to > be non zero, CONST_DOUBLES are only used to hold floating point > values. If the target leaves TARGET_SUPPORTS_WIDE_INT defined as 0, > CONST_WIDE_INTs are not used and CONST_DOUBLEs are as they were > before. > > CONST_INT does not change except that it is defined to hold all > constants that fit in exactly one HOST_WIDE_INT. Note that is slightly > different than the current trunk. Before this patch, the TImode > constant '5' could either be in a CONST_INT or CONST_DOUBLE depending > on which code path was used to create it. This patch changes this so > that if the constant fits in a CONST_INT then it is represented in a > CONST_INT no matter how it is created. > > For the array inside a CONST_WIDE_INT, and internally in wide-int, we > use a compressed form for integers that need more than one > HOST_WIDE_INT. Higher elements of the array are not needed if they > are just a sign extension of the elements below them. This does not > imply that constants are signed or are sign extended, this is only a > compression technique. > > While it might seem to be more esthetically pleasing to have not > introduced the CONST_WIDE_INT and to have changed the representation > of the CONST_INT to accomodate larger numbers, this would have both > used more space and would be a time consuming change for the port > maintainers. We believe that most ports can be quickly converted with > the current scheme because there is just not a lot of code in the back > ends that cares about large constants. Furthermore, the CONST_INT is > very space efficient and even in a program that was heavy in large > values, most constants would still fit in a CONST_INT. > > All of the parts of the rtl level that deal with CONST_DOUBLE as an > now conditionally work with CONST_WIDE_INTs depending on
Re: [Patch] Fix PR53397
On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan wrote: > Hi Richi, > > (Snip) >> + (!cst_and_fits_in_hwi (step)) >> +{ >> + if( loop->inner != NULL) >> +{ >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> +{ >> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >> + fprintf (dump_file, "(base " ); >> + print_generic_expr (dump_file, base, TDF_SLIM); >> + fprintf (dump_file, ", step "); >> + print_generic_expr (dump_file, step, TDF_TREE); >> + fprintf (dump_file, ")\n"); > > No need to repeat this - all references are dumped when we gather them. > (Snip) > > The dumping happens at "record_ref" which is called after these statements to > record these references. > > When the step is invariant we return from the function without recording the > references. > > so I thought of dumping the references here. > > Is there a cleaner way to dump the references at one place? Yes, call dump_mem_ref then, instead of repeating parts of its body. Richard. > Regards, > Venkat. > > > > -Original Message- > From: Richard Guenther [mailto:rguent...@suse.de] > Sent: Tuesday, October 02, 2012 5:42 PM > To: Kumar, Venkataramanan > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [Patch] Fix PR53397 > > On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: > >> Hi, >> >> The below patch fixes the FFT/Scimark regression caused by useless >> prefetch generation. >> >> This fix tries to make prefetch less aggressive by prefetching arrays >> in the inner loop, when the step is invariant in the entire loop nest. >> >> GCC currently tries to prefetch invariant steps when they are in the >> inner loop. But does not check if the step is variant in outer loops. >> >> In the scimark FFT case, the trip count of the inner loop varies by a >> non constant step, which is invariant in the inner loop. >> But the step variable is varying in outer loop. This makes inner loop >> trip count small (at run time varies sometimes as small as 1 >> iteration) >> >> Prefetching ahead x iteration when the inner loop trip count is >> smaller than x leads to useless prefetches. >> >> Flag used: -O3 -march=amdfam10 >> >> Before >> ** ** >> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >> ** for details. (Results can be submitted to p...@nist.gov) ** >> ** ** >> Using 2.00 seconds min time per kenel. >> Composite Score: 550.50 >> FFT Mflops:38.66(N=1024) >> SOR Mflops: 617.61(100 x 100) >> MonteCarlo: Mflops: 173.74 >> Sparse matmult Mflops: 675.63(N=1000, nz=5000) >> LU Mflops: 1246.88(M=100, N=100) >> >> >> After >> ** ** >> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >> ** for details. (Results can be submitted to p...@nist.gov) ** >> ** ** >> Using 2.00 seconds min time per kenel. >> Composite Score: 639.20 >> FFT Mflops: 479.19(N=1024) >> SOR Mflops: 617.61(100 x 100) >> MonteCarlo: Mflops: 173.18 >> Sparse matmult Mflops: 679.13(N=1000, nz=5000) >> LU Mflops: 1246.88(M=100, N=100) >> >> GCC regression "make check -k" passes with x86_64-unknown-linux-gnu >> New tests that PASS: >> >> gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c >> scan-tree-dump aprefetch "Issued prefetch" >> gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c >> scan-tree-dump aprefetch "loop variant step" >> gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching" >> gcc.dg/pr53397-2.c (test for excess errors) >> >> >> Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted. >> >> Ok to commit in trunk? >> >> regards, >> Venkat >> >> gcc/ChangeLog >> +2012-10-01 Venkataramanan Kumar >> + >> + * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$ >> + Perform non constant step prefetching in inner loop, only $ >> + when it is inva
Re: [RFC] Make vectorizer to skip loops with small iteration estimate
On Thu, 4 Oct 2012, Jan Hubicka wrote: > > > So SOC cancels out in the runtime check. > > > I still think we need two formulas - one determining if vectorization is > > > profitable, other specifying the threshold for scalar path at runtime > > > (that > > > will generally give lower values). > > > > True, we want two values. But part of the scalar path right now > > is all the computation required for alias and alignment runtime checks > > (because the way all the conditions are combined). > > > > I'm not much into the details of what we account for in SOC (I suppose > > it's everything we insert in the preheader of the vector loop). > > Yes, it seems contain everything we insert prior the loop in unfolded form. > > > > + if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS)) > > +fprintf (vect_dump, "not vectorized: estimated iteration count > > too small."); > > + if (vect_print_dump_info (REPORT_DETAILS)) > > +fprintf (vect_dump, "not vectorized: estimated iteration count > > smaller than " > > + "user specified loop bound parameter or minimum " > > + "profitable iterations (whichever is more > > conservative)."); > > > > this won't work anymore btw - dumping infrastructure changed. > > Ah, will update that. > > > > I suppose your patch is a step in the right direction, but to really > > make progress we need to re-organize the loop and predicate structure > > produced by the vectorizer. > > This reminds me what I did for string functions on x86. It gets very hard > to get all the paths right when one starts to be really cureful to not > output too much cruft on the short paths + do not consume too many registers. > > In fact I want to re-think this for the SSE string ops patch, so I may try to > look into that incrementally. > > > > So, please update your patch, re-test and then it's ok. > > Thanks. > > > I tested enabling loop_ch in early passes with -fprofile-feedback and it > > > is SPEC > > > neutral. Given that it improves loop count estimates, I would still like > > > mainline > > > doing that. I do not like these quite important estimates to be wrong > > > most of time. > > > > I agree. It also helps getting rid of once rolling loops I think. > > I am attaching the patch for early-ch. Will commit it tomorrow. > > Concerning jump threading, it would help to make some of it during early > passes > so the profile estiamte do not get invalided. I tried to move VRP early but > now it > makes compiler to hang during bootstrap. I will debug that. > > > > > > > > > > Btw, I added a "similar" check in vect_analyze_loop_operations: > > > > > > > > if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > > > >&& (LOOP_VINFO_INT_NITERS (loop_vinfo) < vectorization_factor)) > > > > || ((max_niter = max_stmt_executions_int (loop)) != -1 > > > > && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor)) > > > > { > > > > if (dump_kind_p (MSG_MISSED_OPTIMIZATION)) > > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > > "not vectorized: iteration count too small."); > > > > if (dump_kind_p (MSG_MISSED_OPTIMIZATION)) > > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > > "not vectorized: iteration count smaller than " > > > > "vectorization factor."); > > > > return false; > > > > } > > > > > > > > maybe you simply need to update that to also consider the profile? > > > > > > Hmm, I am still getting familiar wth the code. Later we later have > > > if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > > > && LOOP_VINFO_INT_NITERS (loop_vinfo) <= th) > > > { > > > if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS)) > > > fprintf (vect_dump, "not vectorized: vectorization not " > > > "profitable."); > > > if (vect_print_dump_info (REPORT_DETAILS)) > > > fprintf (vect_dump, "not vectorized: iteration count smaller than > > > " > > > "user specified loop bound parameter or minimum " > > > "profitable iterations (whichever is more > > > conservative)."); > > > return false; > > > } > > > > > > where th is always greater or equal than vectorization_factor from the > > > cost model. > > > So this test seems redundant if the max_stmt_executions_int was pushed > > > down > > > to the second conditoinal? > > > > Yes, sort of. The new check was supposed to be crystal clear, and > > even with the cost model disabled we want to not vectorize in this > > case. But yes, the whole cost-model stuff needs TLC. > > Ah yes, without cost model we would skip it. I suppose we do not need to > brother witht he profile estiamte in the case anyway. They are kind of aprt > of > the cost models. > > * passes.c (init_optimization_passes): Schedule early CH. >
Re: patch to fix constant math
hile he has carefully checked the rtl stuff, > i think that the code inside wide-int is not in his comfort zone of things > he would approve. > > As far as your btw - noticed this last night. it is an artifact of the way > i produced the patch and "responsible people have been sacked". However, > it shows that you read the patch carefully, and i really appreciate that. > i owe you a beer (not that you need another at this time of year). You also didn't mention the missing tree bits ... was this just a 1/n patch or is it at all usable for you in this state? Where do the large integers magically come from? Richard. > Kenny > > > > On 10/04/2012 08:48 AM, Richard Guenther wrote: >> >> On Wed, Oct 3, 2012 at 7:15 PM, Kenneth Zadeck >> wrote: >>> >>> The enclosed patch is the third of at least four patches that fix the >>> problems associated with supporting integers on the target that are >>> wider than two HOST_WIDE_INTs. >>> >>> While GCC claims to support OI mode, and we have two public ports that >>> make minor use of this mode, in practice, compilation that uses OImode >>> mode commonly gives the wrong result or ices. We have a private port >>> of GCC for an architecture that is further down the road to needing >>> comprehensive OImode and we have discovered that this is unusable. We >>> have decided to fix it in a general way that so that it is most >>> beneficial to the GCC community. It is our belief that we are just a >>> little ahead of the X86 and the NEON and these patches will shortly be >>> essential. >>> >>> The first two of these patches were primarily lexigraphical and have >>> already been committed.They transformed the uses of CONST_DOUBLE >>> so that it is easy to tell what the intended usage is. >>> >>> The underlying structures in the next two patches are very general: >>> once they are added to the compiler, the compiler will be able to >>> support targets with any size of integer from hosts of any size >>> integer. >>> >>> The patch enclosed deals with the portable RTL parts of the compiler. >>> The next patch, which is currently under construction deals with the >>> tree level. However, this patch can be put on the trunk as is, and it >>> will eleviate many, but not all of the current limitations in the rtl >>> parts of the compiler. >>> >>> Some of the patch is conditional, depending on a port defining the >>> symbol 'TARGET_SUPPORTS_WIDE_INT' to be non zero. Defining this >>> symbol to be non zero is declaring that the port has been converted to >>> use the new form or integer constants. However, the patch is >>> completely backwards compatible to allow ports that do not need this >>> immediately to convert at their leasure. The conversion process is >>> not difficult, but it does require some knowledge of the port, so we >>> are not volinteering to do this for all ports. >>> >>> OVERVIEW OF THE PATCH: >>> >>> The patch defines a new datatype, a 'wide_int' (defined in >>> wide-int.[ch], and this datatype will be used to perform all of the >>> integer constant math in the compiler. Externally, wide-int is very >>> similar to double-int except that it does not have the limitation that >>> math must be done on exactly two HOST_WIDE_INTs. >>> >>> Internally, a wide_int is a structure that contains a fixed sized >>> array of HOST_WIDE_INTs, a length field and a mode. The size of the >> >> That it has a mode sounds odd to me and makes it subtly different >> from HOST_WIDE_INT and double-int. Maybe the patch will tell >> why this is so. >> >>> array is determined at generation time by dividing the number of bits >>> of the largest integer supported on the target by the number of bits >>> in a HOST_WIDE_INT of the host. Thus, with this format, any length of >>> integer can be supported on any host. >>> >>> A new rtx type is created, the CONST_WIDE_INT, which contains a >>> garbage collected array of HOST_WIDE_INTS that is large enough to hold >>> the constant. For the targets that define TARGET_SUPPORTS_WIDE_INT to >>> be non zero, CONST_DOUBLES are only used to hold floating point >>> values. If the target leaves TARGET_SUPPORTS_WIDE_INT defined as 0, >>> CONST_WIDE_INTs are not used and CONST_DOUBLEs are as they were >>> before. >>> >>> CONST_INT does not change except that it is defined
Re: [lra] patch to solve most scalability problems for LRA
On Thu, Oct 4, 2012 at 7:07 PM, Steven Bosscher wrote: > On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov wrote: >> The only issue now is PR54146 compilation time for IRA+LRA although it >> was improved significantly. I will continue work on PR54146. But now I >> am going to focus on proposals from reviews. > > Right, there still are opportunities to improve things. > > (The real solution may be to stop SRA from creating so many > simultaneously live pseudos in the first place...) > >> + lra_simple_p >> += (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block); > > I think you should use n_basic_blocks here instead of > last_basic_block, in case this runs without compacting the cfg first > (n_basic_blocks is the real number of basic blocks in the cfg, > last_basic_block is the highest index, so last_basic_block >= > n_basic_blocks). I also noticed that switching to IRA_REGION_ONE improves things when we have a large number of loops (profile points to some loop code in IRA). Note that the magic number above should be a new --param, and once we have a diagnostic flag that shows whenever we back off like this it should notify the user of that fact (and the params we have overflown) - this just reminded me of that idea from somebody else ;) > Thanks for working on this! Indeed ;) It, btw, also applies to IRA + reload ... Richard. > Ciao! > Steven
Re: PATCH trunk: gengtype honoring mark_hook-s inside struct inside union-s
On Thu, Oct 4, 2012 at 7:24 PM, Basile Starynkevitch wrote: > On Thu, Oct 04, 2012 at 06:51:35PM +0300, Laurynas Biveinis wrote: >> > 2012-10-03 Basile Starynkevitch >> > >> > * gengtype.c (walk_type): Emit mark_hook when inside a >> > struct of a union member. >> >> This is OK. > > thanks, Committed revision 192092 to trunk. > > > I believe this patch should be backported into GCC 4.7 and 4.6 I see no reason for this unless it is a regression. Richard. > Regards. > -- > Basile STARYNKEVITCH http://starynkevitch.net/Basile/ > email: basilestarynkevitchnet mobile: +33 6 8501 2359 > 8, rue de la Faiencerie, 92340 Bourg La Reine, France > *** opinions {are only mines, sont seulement les miennes} ***
Re: [PATCH] Teach VRP to handle if ((unsigned_narrowing_cast) x != 0) similarly to if ((x & 0xffff) != 0) (PR tree-optimization/54810)
On Thu, Oct 4, 2012 at 6:31 PM, Jakub Jelinek wrote: > Hi! > > This patch handles unsigned narrowing casts the same as > BIT_AND_EXPR with the unsigned narrow type's max value. > > Bootstrapped/regtested on x86_64-linux and i686-linux, > ok for trunk? Ok. Thanks, Richard. > 2012-10-04 Jakub Jelinek > > PR tree-optimization/54810 > * tree-vrp.c (register_edge_assert_for_2): Handle > NAME = (unsigned) NAME2; if (NAME cmp CST) for > narrowing casts to unsigned integral type like > NAME = NAME2 & CST2; if (NAME cmp CST) where CST2 > is the max value of the unsigned integral type. > > --- gcc/tree-vrp.c.jj 2012-09-25 14:45:48.0 +0200 > +++ gcc/tree-vrp.c 2012-10-04 11:43:32.334988401 +0200 > @@ -4712,6 +4712,11 @@ register_edge_assert_for_2 (tree name, e >tree val2 = NULL_TREE; >double_int mask = double_int_zero; >unsigned int prec = TYPE_PRECISION (TREE_TYPE (val)); > + unsigned int nprec = prec; > + enum tree_code rhs_code = ERROR_MARK; > + > + if (is_gimple_assign (def_stmt)) > + rhs_code = gimple_assign_rhs_code (def_stmt); > >/* Add asserts for NAME cmp CST and NAME being defined > as NAME = (int) NAME2. */ > @@ -4721,7 +4726,7 @@ register_edge_assert_for_2 (tree name, e > && gimple_assign_cast_p (def_stmt)) > { > name2 = gimple_assign_rhs1 (def_stmt); > - if (CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def_stmt)) > + if (CONVERT_EXPR_CODE_P (rhs_code) > && INTEGRAL_TYPE_P (TREE_TYPE (name2)) > && TYPE_UNSIGNED (TREE_TYPE (name2)) > && prec == TYPE_PRECISION (TREE_TYPE (name2)) > @@ -4767,8 +4772,7 @@ register_edge_assert_for_2 (tree name, e > NAME = NAME2 >> CST2. > > Extract CST2 from the right shift. */ > - if (is_gimple_assign (def_stmt) > - && gimple_assign_rhs_code (def_stmt) == RSHIFT_EXPR) > + if (rhs_code == RSHIFT_EXPR) > { > name2 = gimple_assign_rhs1 (def_stmt); > cst2 = gimple_assign_rhs2 (def_stmt); > @@ -4840,21 +4844,37 @@ register_edge_assert_for_2 (tree name, e >/* Add asserts for NAME cmp CST and NAME being defined as > NAME = NAME2 & CST2. > > -Extract CST2 from the and. */ > +Extract CST2 from the and. > + > +Also handle > +NAME = (unsigned) NAME2; > +casts where NAME's type is unsigned and has smaller precision > +than NAME2's type as if it was NAME = NAME2 & MASK. */ >names[0] = NULL_TREE; >names[1] = NULL_TREE; >cst2 = NULL_TREE; > - if (is_gimple_assign (def_stmt) > - && gimple_assign_rhs_code (def_stmt) == BIT_AND_EXPR) > + if (rhs_code == BIT_AND_EXPR > + || (CONVERT_EXPR_CODE_P (rhs_code) > + && TREE_CODE (TREE_TYPE (val)) == INTEGER_TYPE > + && TYPE_UNSIGNED (TREE_TYPE (val)) > + && TYPE_PRECISION (TREE_TYPE (gimple_assign_rhs1 (def_stmt))) > +> prec > + && !retval)) > { > name2 = gimple_assign_rhs1 (def_stmt); > - cst2 = gimple_assign_rhs2 (def_stmt); > + if (rhs_code == BIT_AND_EXPR) > + cst2 = gimple_assign_rhs2 (def_stmt); > + else > + { > + cst2 = TYPE_MAX_VALUE (TREE_TYPE (val)); > + nprec = TYPE_PRECISION (TREE_TYPE (name2)); > + } > if (TREE_CODE (name2) == SSA_NAME > && INTEGRAL_TYPE_P (TREE_TYPE (name2)) > && TREE_CODE (cst2) == INTEGER_CST > && !integer_zerop (cst2) > - && prec <= HOST_BITS_PER_DOUBLE_INT > - && (prec > 1 > + && nprec <= HOST_BITS_PER_DOUBLE_INT > + && (nprec > 1 > || TYPE_UNSIGNED (TREE_TYPE (val > { > gimple def_stmt2 = SSA_NAME_DEF_STMT (name2); > @@ -4881,12 +4901,12 @@ register_edge_assert_for_2 (tree name, e > bool valid_p = false, valn = false, cst2n = false; > enum tree_code ccode = comp_code; > > - valv = tree_to_double_int (val).zext (prec); > - cst2v = tree_to_double_int (cst2).zext (prec); > + valv = tree_to_double_int (val).zext (nprec); > + cst2v = tree_to_double_int (cst2).zext (nprec); > if (!TYPE_UNSIGNED (TREE_TYPE (val))) > { > - valn = valv.sext (prec).is_negative (); > - cst2n = cst2v.sext (prec).is_negative (); > + valn = valv.sext (nprec).is_negative (); > + cst2n = cst2v.sext (nprec).is_negative (); > } > /* If CST2 doesn't have most significant bit set, > but VAL is negative, we have comparison like > @@ -4894,7 +4914,7 @@ register_edge_assert_for_2 (tree name, e > if (!cst2n && valn) > ccode = ERROR_MARK; > if (cst2n) > - sgnbit = do
Re: Convert more non-GTY htab_t to hash_table.
On Thu, 4 Oct 2012, Lawrence Crowl wrote: > On 10/4/12, Richard Guenther wrote: > > On Tue, 2 Oct 2012, Lawrence Crowl wrote: > >> On 10/2/12, Richard Guenther wrote: > >> > On Mon, 1 Oct 2012, Lawrence Crowl wrote: > >> > > Change more non-GTY hash tables to use the new type-safe > >> > > template hash table. Constify member function parameters that > >> > > can be const. Correct a couple of expressions in formerly > >> > > uninstantiated templates. > >> > > > >> > > The new code is 0.362% faster in bootstrap, with a 99.5% > >> > > confidence of being faster. > >> > > > >> > > Tested on x86-64. > >> > > > >> > > Okay for trunk? > >> > > >> > You are changing a hashtable used by fold checking, did you test > >> > with fold checking enabled? > >> > >> I didn't know I had to do anything beyond the normal make check. > >> What do I do? > >> > >> > +/* Data structures used to maintain mapping between basic blocks and > >> > + copies. */ > >> > +static hash_table bb_original; > >> > +static hash_table bb_copy; > >> > > >> > note that because hash_table has a constructor we now get global > >> > CTORs for all statics :( (and mx-protected local inits ...) > >> > >> The overhead for the global constructors isn't significant. > >> Only the function-local statics have mx-protection, and that can > >> be eliminated by making them global static. > >> > >> > Can you please try to remove the constructor from hash_table to > >> > avoid this overhead? (as a followup - that is, don't initialize > >> > htab) > >> > >> The initialization avoids potential errors in calling dispose. > >> I can do it, but I don't think the overhead (after moving the > >> function-local statics to global) will matter, and so I prefer to > >> keep the safety. So is the move of the statics sufficient or do > >> you still want to remove constructors? > > > > Hm, having them in-scope where they are used is good style. > > Why can't they be statically initialized and put in .data? > > Please make it so - you know C++ enough (ISTR value-initialization > > is default - which means NULL for the pointer?) > > Zero initialization is default for static variables, but not for > local or heap variables. We can live with the uninitialized memory Not for local static variables? I mean, isn't it possible to have an initializer like foo () { static X x = X(); that behaves in the way that it is "inlined", that is, implemented as putting x into .data, with proper initial contents? Then, why isn't there a way that X is default (aka zero) initialized when there is no initializer? I simply want C behavior back here ... and I would have expected that not providing the hash_table::hash_table() constructor would just do that (with the complication that new-style allocation will end up with an uninitialized object, which we could avoid with providing a operator new implementation?). > in some cases, and add another function to explicitly null the > member in the rest of the cases. I am not convinced that extra > coding is worth the performance difference, particularly as I do > not expect that difference to be measureable. > > However we decide here, I think that work should be a separate patch, > as it will certainly touch more files than the current patch. So, > can we separate the issue? Sure, see my initial request. Thanks, Richard.
Re: RFC: C++ PATCH to support dynamic initialization and destruction of C++11 and OpenMP TLS variables
On Thu, Oct 4, 2012 at 7:38 PM, Jason Merrill wrote: > Both C++11 and OpenMP specify that thread_local/threadprivate variables can > have dynamic initialization and destruction semantics. This sequence of > patches implements that. > > The first patch adds the C++11 thread_local keyword and implements the C++11 > parsing rules for thread_local, which differ somewhat from __thread. It > also allows dynamic initialization of function-local thread_local variables. > > The second patch adds a __cxa_thread_atexit interface for registering > cleanups to be run when a thread exits. The current implementation is not > fully conformant; on exit, TLS destructors are supposed to run before any > destructors for non-TLS variables, but I think that will require glibc > support for __cxa_thread_atexit. > > The third patch implements dynamic initialization of non-function-local TLS > variables using the init-on-first-use idiom: uses of such a variable are > replaced with calls to a wrapper function, so that > > int f(); > thread_local int i = f(); > > implies > > void i_init() { > static bool initialized; > if (!initialized) > { > initialized = true; > i = f(); > } > } > inline int& i_wrapper() { > i_init(); > return i; > } > > Note that if we see > > extern thread_local int i; > > in a header, we don't know whether it has a dynamic initializer in its > defining translation unit, so we need to make conservative assumptions. For > a type that doesn't always get dynamic initialization, we make i_init a > weakref and only call it if it exists. In the same translation unit as the > definition, we optimize appropriately. > > The wrapper function is the function I'm talking about in > http://gcc.gnu.org/ml/gcc/2012-10/msg00024.html > > Any comments before I check this in? I wonder if an implementation is conforming that performs non-local TLS variable inits at thread creation time instead (probably also would require glibc support)? Or if we have the extra indirection via a reference anyway, we could have a pointer TLS variable (NULL initialized) that on the first access will trap where in a trap handler we could then perform initialization and setup of that pointer. Should we document our choice of implementation somewhere in the manual? And thus implicitely provide the hint that using non-local TLS variables with dynamic initialization comes with an abstraction penalty that we might not be easily able to remove? Thanks, Richard. > Jason
Re: Use conditional casting with symtab_node
On Thu, Oct 4, 2012 at 8:16 PM, Diego Novillo wrote: > On Thu, Oct 4, 2012 at 2:14 PM, Lawrence Crowl wrote: > >> So, Jan Hubicka requested and approved the current spelling. >> What now? > > I don't think we should hold this up. The names Jan requested seem > reasonable enough. We seem to be running in circles here. I suppose I have your promise that we'll release with consistent names. Please allocate some work hours on your side for the renaming of cgraph_node and varpool_node for the case Honza doesn't get to it in time. I see all these patches with mixed feeling - it puts breaks on all developers because they need to learn the new interface which does not bring any immediate benefit. So I think _your_ development time would be better spent by fixing open bugs or by tackling some of the existing scalability issues in GCC (rather than quoting funny '0.001% faster with 99% confidence' stuff). Thanks, Richard. > > Diego.
Re: patch to fix constant math
On Thu, Oct 4, 2012 at 9:27 PM, Richard Sandiford wrote: > Kenneth Zadeck writes: >> On 10/04/2012 12:58 PM, Richard Guenther wrote: >>> On Thu, Oct 4, 2012 at 3:55 PM, Kenneth Zadeck >>> wrote: >>>> Let me talk about the mode here first. >>>> >>>> What this interface/patch provides is a facility where the constant math >>>> that is done in optimizations is done exactly the way that it would be done >>>> on the target machine. What we have now is a compiler that only does this >>>> if it convenient to do on the host. I admit that i care about this more >>>> than others right now, but if intel adds a couple of more instructions to >>>> their vector units, other people will start to really care about this >>>> issue. >>>> If you take an OImode value with the current compiler and left shift it by >>>> 250 the middle end will say that the result is 0. This is just wrong!!! >>>> >>>> What this means is that the bitsize and precision of the operations need to >>>> be carried along when doing math. when wide-int checks for overflow on the >>>> multiply or add, it is not checking the if the value overflowed on two >>>> HWIs, >>>> it is checking if the add overflowed in the mode of the types that are >>>> represented on the target. When we do shift, we are not doing a shift >>>> within two HWIs, we are truncating the shift value (if this is appropriate) >>>> according to the bitsize and shifting according the precision. >>>> >>>> I think that an argument could be made that storing the mode should be >>>> changed to an explicit precision and bitsize. (A possible other option >>>> would be to store a tree type, but this would make the usage at the rtl >>>> level very cumbersome since types are rare.) Aside from the work, you would >>>> not get much push back. >>>> >>>> But the signess is a different argument. At the rtl level, the signess is >>>> a matter of context. (you could argue that this is a mistake and i would >>>> agree, but that is an even bigger change.) But more to the point, at the >>>> tree level, there are a surprising number of places where the operation >>>> desired does not follow the sign of the types that were used to construct >>>> the constants. Furthermore, not carrying the sign is more consistent with >>>> the double int code, which as you point out carries nothing. >>> Well, on RTL the signedness is on the operation (you have sdiv and udiv, >>> etc.). >> yes, there is a complete enough set of operations that allow you to >> specify the signess where this matters. >> >>> double-int tries to present a sign-less twos-complement entity of size >>> 2 * HOST_BITS_PER_WIDE_INT. I think that is sensible and for >>> obvious reasons should not change. Both tree and RTL rely on this. > > I disagree, at least from an RTL perspective. HOST_BITS_PER_WIDE_INT is > a host property, and as Kenny says, it's not really sensible to tie our > main target-specific IR to something host-specific. We've only done > that for historical reasons. Oh, I agree - that HOST_WIDE_INT provides the limit for the largest integer constants we can encode on a target is a bug! > On a target that's been converted to wide_int, I don't think a pair > of HWIs (whether or not it's expressed as a double_int) has any > significance at all at the RTL level. > > As far as the wide_ints recording a mode or precision goes: we're in > the "lucky" position of having tried both options. Trees record the > type (and thus precision) of all compile-time integer constants. > RTL doesn't. And the RTL way is a right pain. It leads to interfaces > in which rtx arguments often have to be accompanied by a mode. And it > leads to clunky differences like those between simplify_unary_operation > (which takes two mode arguments) and simplify_binary_operation > (which with our current set of binary operations requires only one). But the issue here is not that double_int doesn't record a mode or precision (or a sign). The issue is that CONST_DOUBLE and CONST_INT don't! The _tree_ INTEGER_CST contains of a type and a double-int. I see double-int (and the proposed wide-int) as a common building-block used for kind of "arbitrary precision" (as far as the target needs) integer constants on both tree and RTL. And having a common worker implementation requires it to operate on something different than tree types or RTL mo
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 11:26 AM, Richard Guenther wrote: > Look at RTL users of the double-int routines and provide wrappers > that take RTXen as inputs. Enforce that all CONSTs have a mode. Which would, btw, allow to "merge" CONST_INT, CONST_DOUBLE and CONST_WIDE by making the storage size variable and it's length specified by the mode of the RTX (I never liked the distinction of CONST_INT and CONST_DOUBLE, and you are right, the CONST_DOUBLE paths are seldomly excercised). Richard.
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 11:55 AM, Richard Sandiford wrote: > Richard Guenther writes: >>> As far as the wide_ints recording a mode or precision goes: we're in >>> the "lucky" position of having tried both options. Trees record the >>> type (and thus precision) of all compile-time integer constants. >>> RTL doesn't. And the RTL way is a right pain. It leads to interfaces >>> in which rtx arguments often have to be accompanied by a mode. And it >>> leads to clunky differences like those between simplify_unary_operation >>> (which takes two mode arguments) and simplify_binary_operation >>> (which with our current set of binary operations requires only one). >> >> But the issue here is not that double_int doesn't record a mode >> or precision (or a sign). The issue is that CONST_DOUBLE and CONST_INT >> don't! The _tree_ INTEGER_CST contains of a type and a double-int. >> I see double-int (and the proposed wide-int) as a common building-block >> used for kind of "arbitrary precision" (as far as the target needs) integer >> constants on both tree and RTL. And having a common worker implementation >> requires it to operate on something different than tree types or RTL mode >> plus signedness. >> >>> To put it another way: every wide_int operation needs to know >>> the precision of its arguments and the precision of its result. >> >> That's not true. Every tree or RTL operation does, not every >> wide_int operation. double_int's are just twos-complement numbers >> of precision 2 * HOST_BITS_PER_WIDE_INT. wide_int's should >> be just twos-complement numbers of precision len * >> WHATEVER_HOST_COMPONENT_TYPE_IS_SUITABLE_FOR_A_FAST_IMPLEMENTATION. >> Operations on double_int and wide_int are bare-metal, >> in nearly all of the times you should use routines that do >> proper sign-/zero-extending to a possibly smaller precision. When >> using bare-metal you are supposed to do that yourself. >> >> Which is why I was suggesting to add double_sint, double_uint, >> double_sintp (with precision), etc., wrappers and operations. >> >>> Not storing the precision in the wide_int is putting the onus >>> on the caller to keep track of the precision separately. >> >> But that's a matter of providing something high-level ontop of >> the bare-metal double-int/wide-int (something shared between >> RTL and trees). Duplicating information in the bare-metal >> doesn't make sense (and no, INTEGER_CSTs on the tree level >> are _not_ short-lived, and how can a double-int make sense on >> the tree level when you say it doesn't make sense on the RTL level?) > > I think we're talking at cross purposes here. To the extent that > I'm not really sure where to begin. :-) Just in case this is it: > the idea is that wide_int is the type used to _process_ integers. > It is not suppoed to be the type used to store integers in the IRs. > The way trees store integers and the way that RTL stores integers > are up to them. For RTL the idea is that we continue to store integers > that happen to fit into a HWI as a CONST_INT (regardless of the size of > the CONST_INT's context-determined mode). Wider integers are stored > as a CONST_DOUBLE (for unconverted targets) or a CONST_WIDE_INT > (for converted targets). None of the three use the wide_int type; > they use more compact representations instead. And as Kenny says, > using CONST_INT continues to be an important way of reducing the > IR footprint. > > Whenever we want to do something interesting with the integers, > we construct a wide_int for them and use the same processing code > for all three rtx types. This avoids having the separate single-HWI > and double-HWI paths that we have now. It also copes naturally with > cases where we start out with single-HWI values but end up with wider > ones. > > But the operations that we do on these wide_ints will all be to a mode > or precision. Shifts of QImode integers are different from shifts of > HImode integers, etc. > > If you knew all that already (you probably did) and I've completely > missed the point, please say so. :-) > > I'm not sure what you mean by "bare metal". The issue is that unlike RTL where we "construct" double-ints from CONST_INT/CONST_DOUBLE right now, tree has the double-int _embedded_. That's why I say that the thing we embed in a CONST_WIDE_INT or a tree INTEGER_CST needs to be "bare metal", and that's what I would call wide-int. I think you have two things mixed in this patch which might
Re: [PATCH, gdc] - Merging gdc (GNU D Compiler) into gcc
On Fri, Oct 5, 2012 at 12:07 PM, Iain Buclaw wrote: > On 5 October 2012 01:06, Joseph S. Myers wrote: >> On Thu, 4 Oct 2012, Iain Buclaw wrote: >> >>> The only patches to gcc proper are documentation-related and adding >>> the D frontend / libphobos to configure and make files. I would have >>> thought that these would typically only be included with the actual >>> front-end? >> >> Looking back at my previous review comments, I suggested that you might >> need to split up c-common.[ch] so that certain parts of attribute handling >> could be shared with D, because duplicate code copied from elsewhere in >> GCC was not an appropriate implementation approach. Have you then >> eliminated the duplicate code in some other way that does not involve >> splitting up those files so code can be shared? >> > > Ah, no; thanks for reminding me of this. > > The code duplicated from c-common.[ch] are the handlers for C > __attributes__, however gdc doesn't use all of them because some just > don't have a fitting place eg: gnu_inline, artificial. > > Would the best approach be to move all handle_* functions and any > helper functions into a new source file that can be shared between > frontends, and define two new frontend hooks, > LANG_HOOK_ATTRIBUTE_TABLE and LANG_HOOK_FORMAT_ATTRIBUTE_TABLE ? Btw, the LTO frontend also has most of the stuff duplicated ... (see lto/lto-lang.c). Not sure why ... Richard. > > Regards > -- > Iain Buclaw > > *(p < e ? p++ : p) = (c & 0x0f) + '0';
Re: Use conditional casting with symtab_node
On Fri, Oct 5, 2012 at 12:50 PM, Nathan Froyd wrote: > - Original Message - >> I see all these patches with mixed feeling - it puts breaks on all >> developers >> because they need to learn the new interface which does not bring any >> immediate benefit. So I think _your_ development time would be >> better >> spent by fixing open bugs or by tackling some of the existing >> scalability >> issues in GCC (rather than quoting funny '0.001% faster with 99% >> confidence' >> stuff). > > This tone is unnecessary. Sorry, that wasn't intended. I question these numbers because unless you bootstrap say 100 times the noise in bootstrap speed is way too high to make such claims. Of course critical information is missing: "The new code bootstraps .616% faster with a 99% confidence of being faster." 99% confidence on what basis? What's your sample size? Why does the patch need this kind of "marketing"? > I, for one, think that it's excellent that Lawrence is writing these > cleanup patches and measuring what impact they have on performance. > Bonus points that they are making the compiler faster. Speed of the > compiler *is* a scalability issue, and it's one that GCC doesn't > appear to have paid all that much attention to over the years. I just don't believe the 0.5% numbers. Richard. > -Nathan
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 1:24 PM, Richard Sandiford wrote: > Richard Guenther writes: >> On Fri, Oct 5, 2012 at 11:55 AM, Richard Sandiford >> wrote: >>> Richard Guenther writes: >>>>> As far as the wide_ints recording a mode or precision goes: we're in >>>>> the "lucky" position of having tried both options. Trees record the >>>>> type (and thus precision) of all compile-time integer constants. >>>>> RTL doesn't. And the RTL way is a right pain. It leads to interfaces >>>>> in which rtx arguments often have to be accompanied by a mode. And it >>>>> leads to clunky differences like those between simplify_unary_operation >>>>> (which takes two mode arguments) and simplify_binary_operation >>>>> (which with our current set of binary operations requires only one). >>>> >>>> But the issue here is not that double_int doesn't record a mode >>>> or precision (or a sign). The issue is that CONST_DOUBLE and CONST_INT >>>> don't! The _tree_ INTEGER_CST contains of a type and a double-int. >>>> I see double-int (and the proposed wide-int) as a common building-block >>>> used for kind of "arbitrary precision" (as far as the target needs) integer >>>> constants on both tree and RTL. And having a common worker implementation >>>> requires it to operate on something different than tree types or RTL mode >>>> plus signedness. >>>> >>>>> To put it another way: every wide_int operation needs to know >>>>> the precision of its arguments and the precision of its result. >>>> >>>> That's not true. Every tree or RTL operation does, not every >>>> wide_int operation. double_int's are just twos-complement numbers >>>> of precision 2 * HOST_BITS_PER_WIDE_INT. wide_int's should >>>> be just twos-complement numbers of precision len * >>>> WHATEVER_HOST_COMPONENT_TYPE_IS_SUITABLE_FOR_A_FAST_IMPLEMENTATION. >>>> Operations on double_int and wide_int are bare-metal, >>>> in nearly all of the times you should use routines that do >>>> proper sign-/zero-extending to a possibly smaller precision. When >>>> using bare-metal you are supposed to do that yourself. >>>> >>>> Which is why I was suggesting to add double_sint, double_uint, >>>> double_sintp (with precision), etc., wrappers and operations. >>>> >>>>> Not storing the precision in the wide_int is putting the onus >>>>> on the caller to keep track of the precision separately. >>>> >>>> But that's a matter of providing something high-level ontop of >>>> the bare-metal double-int/wide-int (something shared between >>>> RTL and trees). Duplicating information in the bare-metal >>>> doesn't make sense (and no, INTEGER_CSTs on the tree level >>>> are _not_ short-lived, and how can a double-int make sense on >>>> the tree level when you say it doesn't make sense on the RTL level?) >>> >>> I think we're talking at cross purposes here. To the extent that >>> I'm not really sure where to begin. :-) Just in case this is it: >>> the idea is that wide_int is the type used to _process_ integers. >>> It is not suppoed to be the type used to store integers in the IRs. >>> The way trees store integers and the way that RTL stores integers >>> are up to them. For RTL the idea is that we continue to store integers >>> that happen to fit into a HWI as a CONST_INT (regardless of the size of >>> the CONST_INT's context-determined mode). Wider integers are stored >>> as a CONST_DOUBLE (for unconverted targets) or a CONST_WIDE_INT >>> (for converted targets). None of the three use the wide_int type; >>> they use more compact representations instead. And as Kenny says, >>> using CONST_INT continues to be an important way of reducing the >>> IR footprint. >>> >>> Whenever we want to do something interesting with the integers, >>> we construct a wide_int for them and use the same processing code >>> for all three rtx types. This avoids having the separate single-HWI >>> and double-HWI paths that we have now. It also copes naturally with >>> cases where we start out with single-HWI values but end up with wider >>> ones. >>> >>> But the operations that we do on these wide_ints will all be to a mode >>> or precision. Shifts of
[PATCH] Fix PR54811
This fixes PR54811 for me - still Ada LTO bootstrap fails for me with cgraph verification ICEs. LTO bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2012-10-05 Richard Guenther PR middle-end/54811 * tree-ssa-live.c (clear_unused_block_pointer_1): Look at DECL_DEBUG_EXPR again. Index: gcc/tree-ssa-live.c === --- gcc/tree-ssa-live.c (revision 192114) +++ gcc/tree-ssa-live.c (working copy) @@ -621,6 +621,11 @@ clear_unused_block_pointer_1 (tree *tp, if (EXPR_P (*tp) && TREE_BLOCK (*tp) && !TREE_USED (TREE_BLOCK (*tp))) TREE_SET_BLOCK (*tp, NULL); + if (TREE_CODE (*tp) == VAR_DECL && DECL_DEBUG_EXPR_IS_FROM (*tp)) +{ + tree debug_expr = DECL_DEBUG_EXPR (*tp); + walk_tree (&debug_expr, clear_unused_block_pointer_1, NULL, NULL); +} return NULL_TREE; }
Re: [PATCH] Fix up DW_TAG_formal_parameter placement
On Thu, 4 Oct 2012, Jakub Jelinek wrote: > On Thu, Oct 04, 2012 at 09:42:59AM +0200, Richard Guenther wrote: > > This looks like the wrong place to fix things to me ... either we can > > fix this at the point we create the VAR_DECLs for the optimized away > > PARM_DECLs (or we should delay that until here?) > > No, that is not possible. There is no other block they could be added > to (they are added to DECL_INITIAL block), and they definitely need to > be added there, they are needed for the more common case where it is not > inlined. > And in that case it is the right location, for non-inlined function > DECL_INITIAL block's BLOCK_VARS is added directly as children of the > DW_TAG_subprogram. > > > or we fix it up > > in dwarf2out.c (how does this fix interact with stabs and the other > > debuginfo formats? > > We can't do that either, dwarf2out doesn't have information whether blocks > are really used (as in, any insns/stmts mention that block in > INSN_BLOCK/gimple_block) or not, it is only correct to move the VAR_DECLs > with PARM_DECL DECL_ORIGIN (i.e. DW_TAG_formal_parameter) up if the outer > BLOCK is not referenced by any insn/stmt (i.e. if the ranges of the > inner block with the VAR_DECL and outer block are exactly the same). > If the outer block has range that is superset of the inner block's range, > then the move would invalidly say that the DW_TAG_formal_parameter > is available somewhere where it is not supposed to be available. > > Initially I thought I'd do the moves in tree-ssa-live.c, in > remove_unused_scope_block_p it has information about what blocks are used > by any stmts and what are not. But it would be terribly expensive, > for each VAR_DECL in a block where its BLOCK_SUPERCONTEXT wasn't originally > TREE_USED before remove_unused_scope_block_p (and such blocks up to a > !TREE_USED inlined_function_outer_scope_p), it would need to search > all the BLOCK_SUPERCONTEXT BLOCK_VARS to see if the VAR_DECL isn't present > there as well, and only if not, move to the inlined_function_outer_scope_p > BLOCK. > > Doing it in tree-inline.c is IMHO the right spot, it is the place that > creates the extra artificial BLOCK around the remapped DECL_INITIAL block > and puts function arguments there. At that point we know for sure that > the DECL_INITIAL block has the same ranges as the whole inline function, > and it is desirable to move all arguments to the outer block, not just > those that were still present in DECL_ARGUMENTS during inlining. > > If you want to be more specific on what is to be moved, we could either > add some VAR_DECL flag bit (but that is expensive, we don't have many), > or perhaps just check that DECL_CONTEXT (DECL_ORIGIN (v)) == DECL_ORIGIN (fn) > (but is that ever false?). > > > mentioning DWARF in tree-inline looks odd, > > unless we get rid of the other formats - something I'd of course > > welcome ;)) > > That can be fixed, I can replace the DWARF terminology with something more > fuzzy. Ok, I think I know better what I was after now thinking about this more. initialize_inlined_parameters (id, stmt, fn, bb); if (DECL_INITIAL (fn)) -prepend_lexical_block (id->block, remap_blocks (DECL_INITIAL (fn), id)); the context shows it, initialize_inlined_parameters would be the natural place to "inline" optimized out parameters represented as VAR_DECLs with PARM_DECL DECL_ORIGIN. Of course the way remap_blocks works makes this difficult, if not impossible (or more ugly than your solution). So in the end your patch is ok as-is. Thanks, Richard.
Re: [PATCH] Improve debug info for partial inlining (PR debug/54519, take 2)
On Wed, 3 Oct 2012, Jakub Jelinek wrote: > On Tue, Sep 11, 2012 at 03:59:56PM +0200, Jakub Jelinek wrote: > > As discussed in the PR, right now we do a very bad job for debug info > > of partially inlined functions (both when they are kept only partially > > inlined, or when partial inlining is performed, but doesn't seem to be > > useful and foo.part.N is inlined back, either into the original function, or > > into a function into which foo has been inlined first). > > > > This patch improves that by doing something similar to what ipa-prop.c does, > > in particular for arguments that aren't actually passed to foo.part.N > > we add debug args and corresponding debug bind and debug source bind stmts > > to provide better debug info (if foo.part.N isn't inlined, then > > DW_OP_GNU_parameter_ref is going to be used together with corresponding call > > site arguments). > > > > Bootstrapped/regtested on x86_64-linux and i686-linux, some of the tests > > still fail with some option combinations, am going to file a DF VTA PR for > > that momentarily. Ok for trunk? > > Now that Alex fixed the DF VTA issue, most of the non-LTO FAILs are gone, > the remaining one on x86_64 looks like a GDB bug, and there is one -Os > failure on pr54519-3.c which could be either alias oracle or var-tracking > bug/missing feature - basically there is a non-addressable parameter > in stack slot, and vt_canon_true_dep -> canon_true_dependence thinks an > argument push insn might alias with it, because it doesn't have a MEM_EXPR > and ao_ref_from_mem fails. > > Posting an updated patch, because last night I found that even when the > patch should have fixed > static inline void foo (int x, int y) { asm volatile ("nop"); } > static inline void bar (int z) { foo (z, 0); foo (z, 1); } > int main () > { > bar (0); > bar (1); > return 0; > } > it didn't, there was a confusion when DECL_ORIGIN should be used and when > not. This version of the patch fixes that, on this testcase (adjusted > added as pr54519-6.c) p x, p y and up; p z now work well. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > 2012-10-03 Jakub Jelinek > > PR debug/54519 > * ipa-split.c (split_function): Add debug args and > debug source and normal stmts for args_to_skip which are > gimple regs. > * tree-inline.c (copy_debug_stmt): When inlining, adjust > source debug bind stmts to debug binds of corresponding > DEBUG_EXPR_DECL. > > * gcc.dg/guality/pr54519-1.c: New test. > * gcc.dg/guality/pr54519-2.c: New test. > * gcc.dg/guality/pr54519-3.c: New test. > * gcc.dg/guality/pr54519-4.c: New test. > * gcc.dg/guality/pr54519-5.c: New test. > * gcc.dg/guality/pr54519-6.c: New test. > > --- gcc/ipa-split.c.jj2012-09-25 14:26:52.612821323 +0200 > +++ gcc/ipa-split.c 2012-10-02 17:41:31.030357922 +0200 > @@ -1059,6 +1059,7 @@ split_function (struct split_point *spli >gimple last_stmt = NULL; >unsigned int i; >tree arg, ddef; > + VEC(tree, gc) **debug_args = NULL; > >if (dump_file) > { > @@ -1232,6 +1233,65 @@ split_function (struct split_point *spli >gimple_set_block (call, DECL_INITIAL (current_function_decl)); >VEC_free (tree, heap, args_to_pass); > The following could use a comment on what you are doing ... > + if (args_to_skip) > +for (parm = DECL_ARGUMENTS (current_function_decl), num = 0; > + parm; parm = DECL_CHAIN (parm), num++) > + if (bitmap_bit_p (args_to_skip, num) > + && is_gimple_reg (parm)) > + { > + tree ddecl; > + gimple def_temp; > + > + arg = get_or_create_ssa_default_def (cfun, parm); > + if (!MAY_HAVE_DEBUG_STMTS) > + continue; > + if (debug_args == NULL) > + debug_args = decl_debug_args_insert (node->symbol.decl); > + ddecl = make_node (DEBUG_EXPR_DECL); > + DECL_ARTIFICIAL (ddecl) = 1; > + TREE_TYPE (ddecl) = TREE_TYPE (parm); > + DECL_MODE (ddecl) = DECL_MODE (parm); > + VEC_safe_push (tree, gc, *debug_args, DECL_ORIGIN (parm)); > + VEC_safe_push (tree, gc, *debug_args, ddecl); > + def_temp = gimple_build_debug_bind (ddecl, unshare_expr (arg), > + call); > + gsi_insert_after (&gsi, def_temp, GSI_NEW_STMT); > + } > + if (debug_args != NULL) > +{ > + unsigned int i; > + tree var, vexpr; > + gimple_stmt_iterator cgsi; > + gimple def_temp; > + > + push_cfun (DECL_STRUCT_FUNCTION (node->symbol.decl)); What do you need the push/pop_cfun for? I see ENTRY_BLOCK_PTR (easily fixable), but else? > + var = BLOCK_VARS (DECL_INITIAL (node->symbol.decl)); > + i = VEC_length (tree, *debug_args); > + cgsi = gsi_after_labels (single_succ (ENTRY_BLOCK_PTR)); > + do > + { > + i -= 2; > + while (var != NULL_TREE > + && DECL_ABSTRACT_ORIGIN (var) > + !=
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 2:26 PM, Richard Sandiford wrote: > Richard Guenther writes: >> On Fri, Oct 5, 2012 at 1:24 PM, Richard Sandiford >> wrote: >>> Richard Guenther writes: >>>> The issue is that unlike RTL where we "construct" double-ints from >>>> CONST_INT/CONST_DOUBLE right now, tree has the double-int >>>> _embedded_. That's why I say that the thing we embed in >>>> a CONST_WIDE_INT or a tree INTEGER_CST needs to be >>>> "bare metal", and that's what I would call wide-int. >>> >>> OK, but that's deliberately not what Kenny's patch calls "wide int". >>> The whole idea is that the "bare metal" thing we embed in a >>> CONST_WIDE_INT or tree isn't (and doesn't need to be) the same >>> as the type that we use to operate on integers. That bare-metal >>> thing doesn't even have to have a fixed width. (CONST_WIDE_INT >>> doesn't have a fixed width, it's only as big as the integer >>> needs it to be.) >> >> Ok, let's rephrase it this way then: the "bare metal" thing used >> for the storage should ideally be the same in the tree IL and the RTL >> IL _and_ the higher-level abstract wide-int. > > Hmm, OK, that's a straight disagreement then. > >>>> So to me wide-ints provide the higher-level abstraction ontop of >>>> double-ints (which would remain the storage entity). >>>> >>>> Such higher-level abstraction is very useful, also for double-ints and >>>> also on the tree level. There is no requirement to provide bigger >>>> double-int (or wide-int) for this. Let's call this abstraction >>>> wide-int (as opposed to my suggested more piecemail double_sint, >>>> double_uint). You can perfectly model it ontop of the existing >>>> double_int storage. >>>> >>>> As of providing larger "double-ints" - there is not much code left >>>> (ok, quite an overstatement ;)) that relies on the implementation >>>> detail of double-int containing exactly two HOST_WIDE_INTs. >>>> The exact purpose of double-ints was to _hide_ this (previously >>>> we only had routines like mul_double_with_sign which take >>>> two HOST_WIDE_INT components). Most code dealing with >>>> the implementation detail is code interfacing with middle-end >>>> routines that take a HOST_WIDE_INT argument (thus the >>>> double_int_fits_hwi_p predicates) - even wide_int has to support >>>> this kind of interfacing. >>>> >>>> So, after introducing wide_int that just wraps double_int and >>>> changing all user code (hopefully all, I guess mostly all), we'd >>>> tackle the issue that the size of double_int's is host dependent. >>>> A simple solution is to make it's size dependent on a target macro >>>> (yes, macro ...), so on a 32bit HWI host targeting a 64bit 'HWI' target >>>> you'd simply have four HWIs in the 'double-int' storage (and >>>> arithmetic needs to be generalized to support this). >>> >>> I just don't see why this is a good thing. The constraints >>> are different when storing integers and when operating on them. >>> When operating on them, we want something that is easily constructed >>> on the stack, so we can create temporary structures very cheaply, >>> and can easily pass and return them. We happen to know at GCC's >>> compile time how big the biggest integer will ever be, so it makes >>> sense for wide_int to be that wide. >> >> I'm not arguing against this. I'm just saying that the internal >> representation will depend on the host - not the number of total >> bits, but the number of pieces. > > Sure, but Kenny already has a macro to specify how many bits we need > (MAX_BITSIZE_MODE_ANY_INT). We can certainly wrap: > > HOST_WIDE_INT val[MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT]; > > in a typedef if that's what you prefer. I'd prefer it to be initially double_int, and later "fixed" to double_int with a member like the above. Possibly renamed as well. >>> But when storing integers we want something compact. If your >>> machine supports 256-bit integers, but the code you're compiling >>> makes heavy use of 128-bit integers, why would you want to waste >>> 128 of 256 bits on every stored integer? Which is why even >>> CONST_WIDE_INT doesn
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 2:39 PM, Richard Guenther wrote: > > Ok, I see where you are going. Let me look at the patch again. * The introduction and use of CONST_SCALAR_INT_P could be split out (obvious and good) * DEF_RTL_EXPR(CONST_WIDE_INT, "const_wide_int", "", RTX_CONST_OBJ) defining that new RTX is orthogonal to providing the wide_int abstraction for operating on CONST_INT and CONST_DOUBLE, right? @@ -144,6 +144,7 @@ static bool prefer_and_bit_test (enum machine_mode mode, int bitnum) { bool speed_p; + wide_int mask = wide_int::set_bit_in_zero (bitnum, mode); set_bit_in_zero ... why use this instead of wide_int::zero (mode).set_bit (bitnum) that would match the old double_int_zero.set_bit interface and be more composed of primitives? if (and_test == 0) { @@ -164,8 +165,7 @@ prefer_and_bit_test (enum machine_mode m } /* Fill in the integers. */ - XEXP (and_test, 1) -= immed_double_int_const (double_int_zero.set_bit (bitnum), mode); + XEXP (and_test, 1) = immed_wide_int_const (mask); I suppose immed_wide_int_const may create a CONST_INT, a CONST_DOUBLE or a CONST_WIDE_INT? +#if TARGET_SUPPORTS_WIDE_INT +/* Returns 1 if OP is an operand that is a CONST_INT or CONST_WIDE_INT. */ +int +const_scalar_int_operand (rtx op, enum machine_mode mode) +{ why is it necessary to condition anything on TARGET_SUPPORTS_WIDE_INT? It seems testing/compile coverage of this code will be bad ... Can't you simply map CONST_WIDE_INT to CONST_DOUBLE for targets not supporting CONST_WIDE_INT (what does it mean to "support" CONST_WIDE_INT? Does a target that supports CONST_WIDE_INT no longer use CONST_DOUBLEs for integers?) + if (!CONST_WIDE_INT_P (op)) +return 0; hm, that doesn't match the comment (CONST_INT is supposed to return 1 as well). The comment doesn't really tell what the function does it seems, + int prec = GET_MODE_PRECISION (mode); + int bitsize = GET_MODE_BITSIZE (mode); + + if (CONST_WIDE_INT_NUNITS (op) * HOST_BITS_PER_WIDE_INT > bitsize) + return 0; it's mode argument is not documented. And this doesn't even try to see if the constant would fit the mode anyway (like if it's zero). Not sure what the function is for. + { + /* Multiword partial int. */ + HOST_WIDE_INT x + = CONST_WIDE_INT_ELT (op, CONST_WIDE_INT_NUNITS (op) - 1); + return (wide_int::sext (x, prec & (HOST_BITS_PER_WIDE_INT - 1)) + == x); err - so now we have wide_int with a mode but we pass in another mode and see if they have the same value? Same value as-in signed value? Which probably is why you need to rule out different size constants above? Because you don't know whether to sign or zero extend? +/* Returns 1 if OP is an operand that is a CONST_WIDE_INT. */ +int +const_wide_int_operand (rtx op, enum machine_mode mode) +{ similar comments, common code should be factored out at least. Instead of conditioning a whole set of new function on supports-wide-int please add cases to the existing functions to avoid diverging in pieces that both kind of targets support. @@ -2599,7 +2678,7 @@ constrain_operands (int strict) break; case 'n': - if (CONST_INT_P (op) || CONST_DOUBLE_AS_INT_P (op)) + if (CONST_SCALAR_INT_P (op)) win = 1; factoring out this changes would really make the patch less noisy. skipping to rtl.h bits +struct GTY((variable_size)) hwivec_def { + int num_elem;/* number of elements */ + HOST_WIDE_INT elem[1]; +}; num_elem seems redundant and computable from GET_MODE_SIZE. Or do you want to optimize encoding like for CONST_INT (and unlike CONST_DOUBLE)? I doubt the above packs nicely into rtx_def? A general issue of it though - we waste 32bits on 64bit hosts in rtx_def between the bits and the union. Perfect place for num_elem (if you really need it) and avoids another 4 byte hole. Similar issue exists in rtvec_def. back to where I was @@ -645,6 +673,10 @@ iterative_hash_rtx (const_rtx x, hashval return iterative_hash_object (i, hash); case CONST_INT: return iterative_hash_object (INTVAL (x), hash); +case CONST_WIDE_INT: + for (i = 0; i < CONST_WIDE_INT_NUNITS (x); i++) + hash = iterative_hash_object (CONST_WIDE_INT_ELT (x, i), hash); + return hash; I see CONST_DOUBLEs value is not hashed. Why hash wide-ints value? Seeing +#define HWI_GET_NUM_ELEM(HWIVEC) ((HWIVEC)->num_elem) +#define HWI_PUT_NUM_ELEM(HWIVEC, NUM) ((HWIVEC)->num_elem = (NUM)) skipping to +#if TARGET_SUPPORTS_WIDE_INT + { +rtx value = const_wide_int_alloc (len); +unsigned int i; + +/* It is so tempting to just put the mode in here. Must control + myself ... */ +PUT_MODE (value, VOIDmode); +HWI_PUT_NUM_ELEM (CONST_WIDE_INT_VEC (value), len); why is
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 3:18 PM, Richard Sandiford wrote: > Richard Sandiford writes: > How is CONST_WIDE_INT variable size? It's just the usual trailing variable-length array thing. >>> >>> Good. Do you get rid of CONST_DOUBLE (for integers) at the same time? >> >> Yeah. I initially thought it might be OK to keep them and have >> CONST_INT, integer CONST_DOUBLEs and CONST_WIDE_INT alongside >> each other. (The way the patch is structured means that the >> choice of whether to keep integer CONST_DOUBLEs can be changed >> very easily.) But Kenny convinced me it was a bad idea. > > Sorry to follow up on myself, but to clarify: I was talking about > converted targets here. (As in, I originally thought even converted > targets could continue to use integer CONST_DOUBLEs.) > > Unconverted targets continue to use CONST_DOUBLE. Why is it that not all targets are "converted"? What's the difficulty with that? I really do not like partially transitioning there. Richard. > Richard
Re: [PATCH] Improve debug info for partial inlining (PR debug/54519, take 2)
On Fri, Oct 5, 2012 at 2:49 PM, Jakub Jelinek wrote: > On Fri, Oct 05, 2012 at 02:20:13PM +0200, Richard Guenther wrote: >> The following could use a comment on what you are doing ... > > Will add something. > >> > + if (args_to_skip) >> > +for (parm = DECL_ARGUMENTS (current_function_decl), num = 0; >> > +parm; parm = DECL_CHAIN (parm), num++) >> > + if (bitmap_bit_p (args_to_skip, num) >> > + && is_gimple_reg (parm)) >> > + { >> > + tree ddecl; >> > + gimple def_temp; >> > + >> > + arg = get_or_create_ssa_default_def (cfun, parm); >> > + if (!MAY_HAVE_DEBUG_STMTS) >> > + continue; >> > + if (debug_args == NULL) >> > + debug_args = decl_debug_args_insert (node->symbol.decl); >> > + ddecl = make_node (DEBUG_EXPR_DECL); >> > + DECL_ARTIFICIAL (ddecl) = 1; >> > + TREE_TYPE (ddecl) = TREE_TYPE (parm); >> > + DECL_MODE (ddecl) = DECL_MODE (parm); >> > + VEC_safe_push (tree, gc, *debug_args, DECL_ORIGIN (parm)); >> > + VEC_safe_push (tree, gc, *debug_args, ddecl); >> > + def_temp = gimple_build_debug_bind (ddecl, unshare_expr (arg), >> > + call); >> > + gsi_insert_after (&gsi, def_temp, GSI_NEW_STMT); >> > + } >> > + if (debug_args != NULL) >> > +{ >> > + unsigned int i; >> > + tree var, vexpr; >> > + gimple_stmt_iterator cgsi; >> > + gimple def_temp; >> > + >> > + push_cfun (DECL_STRUCT_FUNCTION (node->symbol.decl)); >> >> What do you need the push/pop_cfun for? I see ENTRY_BLOCK_PTR >> (easily fixable), but else? > > I believe that gsi_insert_before in another function > isn't going to work well. > E.g. update_modified_stmt starts with > if (!ssa_operands_active (cfun)) > return; Hmm, indeed. > Or is it ok to use gsi_insert_before_without_update and expect that > when compilation resumes for the modified callee, it will > update all modified stmts? There is not much that needs > to be updated on the stmts, source bind has not setters nor uses > of any SSA_NAMEs or virtual defs/sets, the normal bind has a DEBUG_EXPR_DECL > on the RHS and so is not a use or def of anything as well. I don't think we want to rely on that ... so just keep the push/pop_cfun. Thanks, Richard. > If push_cfun/pop_cfun is not needed, guess the two loops could be merged, > the reason for two of them was just that I didn't want to push_cfun/pop_cfun > for each optimized away parameter. >> >> > + var = BLOCK_VARS (DECL_INITIAL (node->symbol.decl)); >> > + i = VEC_length (tree, *debug_args); >> > + cgsi = gsi_after_labels (single_succ (ENTRY_BLOCK_PTR)); >> > + do >> > + { >> > + i -= 2; >> > + while (var != NULL_TREE >> > +&& DECL_ABSTRACT_ORIGIN (var) >> > + != VEC_index (tree, *debug_args, i)) >> > + var = TREE_CHAIN (var); >> > + if (var == NULL_TREE) >> > + break; >> > + vexpr = make_node (DEBUG_EXPR_DECL); >> > + parm = VEC_index (tree, *debug_args, i); >> > + DECL_ARTIFICIAL (vexpr) = 1; >> > + TREE_TYPE (vexpr) = TREE_TYPE (parm); >> > + DECL_MODE (vexpr) = DECL_MODE (parm); >> > + def_temp = gimple_build_debug_source_bind (vexpr, parm, >> > +NULL); >> > + gsi_insert_before (&cgsi, def_temp, GSI_SAME_STMT); >> > + def_temp = gimple_build_debug_bind (var, vexpr, NULL); >> > + gsi_insert_before (&cgsi, def_temp, GSI_SAME_STMT); >> > + } >> > + while (i); >> > + pop_cfun (); >> > +} >> > + >> > + t = VEC_index (tree, *debug_args, i + 1); >> > + gimple_debug_source_bind_set_value (stmt, t); >> > + stmt->gsbase.subcode = GIMPLE_DEBUG_BIND; >> >> That looks fishy ... shouldn't it be at least the other way around? >> >> stmt->gsbase.subcode = GIMPLE_DEBUG_BIND; >> gimple_debug_bind_set_value (stmt, t); > > It surely can. I was considering whether to create a new wrapper > in gimple.h for the subcode change of a debug stmt, but if there are > no further places needing this (don't see them right now), that might be > overkill. > >> Otherwise this looks ok. > > Thanks. > > Jakub
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 4:14 PM, Richard Sandiford wrote: > Richard Guenther writes: >> On Fri, Oct 5, 2012 at 3:18 PM, Richard Sandiford >> wrote: >>> Richard Sandiford writes: >>>>>>> How is CONST_WIDE_INT variable size? >>>>>> >>>>>> It's just the usual trailing variable-length array thing. >>>>> >>>>> Good. Do you get rid of CONST_DOUBLE (for integers) at the same time? >>>> >>>> Yeah. I initially thought it might be OK to keep them and have >>>> CONST_INT, integer CONST_DOUBLEs and CONST_WIDE_INT alongside >>>> each other. (The way the patch is structured means that the >>>> choice of whether to keep integer CONST_DOUBLEs can be changed >>>> very easily.) But Kenny convinced me it was a bad idea. >>> >>> Sorry to follow up on myself, but to clarify: I was talking about >>> converted targets here. (As in, I originally thought even converted >>> targets could continue to use integer CONST_DOUBLEs.) >>> >>> Unconverted targets continue to use CONST_DOUBLE. >> >> Why is it that not all targets are "converted"? What's the difficulty >> with that? >> I really do not like partially transitioning there. > > The problem is that CONST_DOUBLE as it exists today has two meanings: > a floating-point meaning and an integer meaning. Ports that handle > CONST_DOUBLEs are aware of this and expect the two things to have > the same rtx code. Whereas in a converted port, the two things > have different rtx codes, and the integers have a different > representation from the current low/high pair. > > So to take the first hit in config/alpha/alpha.c, > namely alpha_rtx_costs: > > case CONST_DOUBLE: > if (x == CONST0_RTX (mode)) > *total = 0; > else if ((outer_code == PLUS && add_operand (x, VOIDmode)) >|| (outer_code == AND && and_operand (x, VOIDmode))) > *total = 0; > else if (add_operand (x, VOIDmode) || and_operand (x, VOIDmode)) > *total = 2; > else > *total = COSTS_N_INSNS (2); > return true; > > What could the patch do to make this work without modification? > The middle two cases are for integers, but the first and last > presumably apply to both integers and floats. I didn't say it does not require changes, just that the transition should be finished. Some ports have little CONST_DOUBLE uses (which is what I am grepping for), and if the max-wide-int-size matches that of the current CONST_DOUBLE there is little chance for code generation differences (in fact there should be none, correct?). In the current patch no target is converted (maybe that's just going to be 5/n?), I'd like to see at least one commonly tested target converted and if we don't convert all targets another commonly tested target to stay unconverted (just check gcc-testresults for which pair of targets that would work). Definitely at the end of stage3 we should have zero unconverted targets (but I doubt converting them is a huge task - I have converted the VECTOR_CST representation as well and we've had the location + BLOCK merge and other changes affecting all targets). Richard. > Richard
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 6:34 PM, Kenneth Zadeck wrote: > richard s, > there are two comments that i deferred to you. that have the word richard > in them, > > richi, > thank, i will start doing this now. > > kenny > > On 10/05/2012 09:49 AM, Richard Guenther wrote: >> >> On Fri, Oct 5, 2012 at 2:39 PM, Richard Guenther >> wrote: >>> >>> Ok, I see where you are going. Let me look at the patch again. >> >> * The introduction and use of CONST_SCALAR_INT_P could be split out >>(obvious and good) >> >> * DEF_RTL_EXPR(CONST_WIDE_INT, "const_wide_int", "", RTX_CONST_OBJ) >>defining that new RTX is orthogonal to providing the wide_int >> abstraction >>for operating on CONST_INT and CONST_DOUBLE, right? >> >> @@ -144,6 +144,7 @@ static bool >> prefer_and_bit_test (enum machine_mode mode, int bitnum) >> { >> bool speed_p; >> + wide_int mask = wide_int::set_bit_in_zero (bitnum, mode); >> >> set_bit_in_zero ... why use this instead of >> wide_int::zero (mode).set_bit (bitnum) that would match the old >> double_int_zero.set_bit interface and be more composed of primitives? > > adding something like this was just based on usage.We do this operation > all over the place, why not make it concise and efficient. The api is > very bottom up. I looked at what the clients were doing all over the place > and added those functions. > > wide-int has both and_not and or_not. double-int only has and_not, but > there are a lot of places in where you do or_nots, so we have or_not also. > > >> if (and_test == 0) >> { >> @@ -164,8 +165,7 @@ prefer_and_bit_test (enum machine_mode m >> } >> >> /* Fill in the integers. */ >> - XEXP (and_test, 1) >> -= immed_double_int_const (double_int_zero.set_bit (bitnum), mode); >> + XEXP (and_test, 1) = immed_wide_int_const (mask); >> >> I suppose immed_wide_int_const may create a CONST_INT, a CONST_DOUBLE >> or a CONST_WIDE_INT? > > yes, on converted targets it builds const_ints and const_wide_ints and on > unconverted targets it builds const_ints and const_doubles.The reason i > did not want to convert the targets is that the code that lives in the > targets generally wants to use the values to create constants and this code > is very very very target specific. > >> >> +#if TARGET_SUPPORTS_WIDE_INT >> +/* Returns 1 if OP is an operand that is a CONST_INT or CONST_WIDE_INT. >> */ >> +int >> +const_scalar_int_operand (rtx op, enum machine_mode mode) >> +{ >> >> why is it necessary to condition anything on TARGET_SUPPORTS_WIDE_INT? >> It seems testing/compile coverage of this code will be bad ... >> >> Can't you simply map CONST_WIDE_INT to CONST_DOUBLE for targets > > The accessors for the two are completely different.const doubles "know" > that there are exactly two hwi's. for const_wide_ints, you only know that > the len is greater than 1. anything with len 1 would be CONST_INT. > > In a modern c++ world, const_int would be a subtype of const_int, but that > is a huge patch. > > >> not supporting CONST_WIDE_INT (what does it mean to "support" >> CONST_WIDE_INT? Does a target that supports CONST_WIDE_INT no >> longer use CONST_DOUBLEs for integers?) > > yes, that is exactly what it means. > >> + if (!CONST_WIDE_INT_P (op)) >> +return 0; >> >> hm, that doesn't match the comment (CONST_INT is supposed to return 1 as >> well). >> The comment doesn't really tell what the function does it seems, > > I need some context here to reply. > > >> + int prec = GET_MODE_PRECISION (mode); >> + int bitsize = GET_MODE_BITSIZE (mode); >> + >> + if (CONST_WIDE_INT_NUNITS (op) * HOST_BITS_PER_WIDE_INT > bitsize) >> + return 0; >> >> it's mode argument is not documented. And this doesn't even try to see if >> the constant would fit the mode anyway (like if it's zero). Not sure what >> the function is for. > > I will upgrade the comments, they were inherited from the old version with > the const_double changed to the const_wide_int > > >> + { >> + /* Multiword partial int. */ >> + HOST_WIDE_INT x >> + = CONST_WIDE_INT_ELT (op, CONST_WIDE_INT_NUNITS (op) - 1); >> + return (wide_int::sext (x, prec & (HOST_BITS_PER_WIDE_INT - 1)) >> + == x); >> >> err - so now we have wide_int with a mode but we pass in ano
Re: patch to fix constant math
On Sun, Oct 7, 2012 at 3:11 PM, Kenneth Zadeck wrote: > > On 10/07/2012 08:47 AM, Richard Guenther wrote: >>>> >>>> len seems redundant unless you want to optimize encoding. >>>> >>len == (precision + HOST_BITS_PER_WIDE_INT - 1) / >>>> >> HOST_BITS_PER_WIDE_INT. >>> >>> > >>> >that is exactly what we do. However, we are optimizing time, not >>> > space. >>> >the value is always stored in compressed form, i.e the same >>> > representation >>> >as is used inside the CONST_WIDE_INTs and INT_CSTs. this makes the >>> >transformation between them very fast. And since we do this a lot, it >>> >needs to be fast. So the len is the number of HWIs needed to represent >>> > the >>> >value which is typically less than what would be implied by the >>> > precision. >> >> But doesn't this require a sign? Otherwise how do you encode TImode >> 0x? >> Would len be 1 here? (and talking about CONST_WIDE_INT vs CONST_INT, >> wouldn't CONST_INT be used anyway for all ints we can encode "small"? Or >> is it (hopefully) the goal to replace CONST_INT with CONST_WIDE_INT >> everywhere?) Or do you say wide_int is always "signed', thus TImode >> 0x >> needs len == 2 and an explicit zero upper HWI word? > > the compression of this has len ==2 with the explicit upper being 0. > > >> Or do you say wide_int >> is always "unsigned", thus TImode -1 needs len == 2? Noting that >> double-ints >> were supposed to be twos-complement (thus, 'unsigned') numbers having >> implicit non-zero bits sounds error-prone. >> >> That said - I don't see any value in optimizing storage for short-lived >> (as you say) wide-ints (apart from limiting it to the mode size). For >> example mutating operations on wide-ints are not really possible >> if you need to extend storage. > > the compression used is independent of whether it is sign or unsigned. but > the compression "looks" signed. i.e. you do not represent the upper hwi if > they would just be a sign extension of the top hwi that is represented. > However, this does not imply anything about the math that was used to set > those bits that way, and you can always go back to the full rep independent > of the sign. > > I do not have any long term plans to merge CONST_INT into CONST_WIDE_INT. > It would be a huge change in the ports and would be fairly space > inefficient. My guess is that in real code, less than 1% of real constants > will have a length greater than 1 even on a 32 bit host. CONST_INT is very > space efficient. This could have been mentioned as part of Richard's > response to your comment about the way we represent the CONST_WIDE_INTs. > In practice, there are almost none of them anyway. > > In fact, you could argue that the tree level did it wrong (not that i am > suggesting to change this). But it makes me think what was going on when > the decision to make TYPE_PRECISION be an INT_CST rather than just a HWI was > made. For that there is an implication that it could never take more than > a HWI since no place in the code even checks TREE_INT_CST_HIGH for these. Well - on the tree level we now always have two HWIs for all INTEGER_CSTs. If we can, based on the size of the underlying mode, reduce that to one HWI we already win something. If we add an explicit length to allow a smaller encoding for larger modes (tree_base conveniently has an available 'int' for this ...) then we'd win in more cases. Thus, is CONST_INT really necessarily better than optimized CONST_WIDE_INT storage? Richard. >>>> >>+ enum Op { >>>> >>+NONE, >>>> >> >>>> >>we don't have sth like this in double-int.h, why was the double-int >>>> >>mechanism >>>> >>not viable? >>> >>> > >>> >i have chosen to use enums for things rather than passing booleans. >> >> But it's bad to use an enum with 4 values for a thing that can only >> possibly >> expect two. You cannot optimize tests as easily. Please retain the bool >> uns >> parameters instead. > > I am breaking it into two enums.
Re: Fixup INTEGER_CST
On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka wrote: > Hi, > I added a santy check that after fixup all types that lost in the merging are > really dead. And it turns out we have some zombies around. > > INTEGER_CST needs special care because it is special cased by the streamer. > We also > do not want to do inplace modificaitons on it because that would corrupt the > hashtable > used by tree.c's sharing code > > Bootstrapped/regtested x86_64-linux, OK? No, I don't think we want to fixup INTEGER_CSTs this way. Instead we want to fixup them where they end up used unfixed. Richard. > > PR lto/54839 > * lto/lto.c (remember_with_vars): Also fixup INTEGER_CST. > (fixup_integer_cst): New functoin. > (lto_ft_type): Fixup BASETYPE of methods and offsets. > Index: lto/lto.c > === > --- lto/lto.c (revision 192164) > +++ lto/lto.c (working copy) > @@ -1408,11 +1408,36 @@ remember_with_vars (tree t) > (tt) = GIMPLE_REGISTER_TYPE (tt); \ > if (VAR_OR_FUNCTION_DECL_P (tt) && TREE_PUBLIC (tt)) \ > remember_with_vars (t); \ > + if (TREE_CODE (tt) == INTEGER_CST) \ > + (tt) = fixup_integer_cst (tt); \ > } \ > } while (0) > > static void lto_fixup_types (tree); > > +/* Return integer_cst T with updated type. */ > + > +static tree > +fixup_integer_cst (tree t) > +{ > + tree type = GIMPLE_REGISTER_TYPE (TREE_TYPE (t)); > + > + if (type == TREE_TYPE (t)) > +return t; > + > + /* If overflow was set, streamer_read_integer_cst > + produced local copy of T. */ > + if (TREE_OVERFLOW (t)) > +{ > + TREE_TYPE (t) = type; > + return t; > +} > + else > + /* Otherwise produce new shared node for the new type. */ > +return build_int_cst_wide (type, TREE_INT_CST_LOW (t), > + TREE_INT_CST_HIGH (t)); > +} > + > /* Fix up fields of a tree_typed T. */ > > static void > @@ -1526,6 +1549,11 @@ lto_ft_type (tree t) >LTO_FIXUP_TREE (t->type_non_common.binfo); > >LTO_FIXUP_TREE (TYPE_CONTEXT (t)); > + > + if (TREE_CODE (t) == METHOD_TYPE) > +TYPE_METHOD_BASETYPE (t); > + if (TREE_CODE (t) == OFFSET_TYPE) > +TYPE_OFFSET_BASETYPE (t); > } > > /* Fix up fields of a BINFO T. */
Re: Fixup INTEGER_CST
On Sun, Oct 7, 2012 at 7:22 PM, Jan Hubicka wrote: >> On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka wrote: >> > Hi, >> > I added a santy check that after fixup all types that lost in the merging >> > are >> > really dead. And it turns out we have some zombies around. >> > >> > INTEGER_CST needs special care because it is special cased by the >> > streamer. We also >> > do not want to do inplace modificaitons on it because that would corrupt >> > the hashtable >> > used by tree.c's sharing code >> > >> > Bootstrapped/regtested x86_64-linux, OK? >> >> No, I don't think we want to fixup INTEGER_CSTs this way. Instead we >> want to fixup >> them where they end up used unfixed. > > Erm, I think it is what the patch does? Ah, indeed. > It replaces pointers to integer_cst with type that did not survive by pointer > to new integer_cst. (with the optimization that INTEGER_CST with overflow > is changed in place because it is allowed to do so). Btw ... >> > @@ -1526,6 +1549,11 @@ lto_ft_type (tree t) >> >LTO_FIXUP_TREE (t->type_non_common.binfo); >> > >> >LTO_FIXUP_TREE (TYPE_CONTEXT (t)); >> > + >> > + if (TREE_CODE (t) == METHOD_TYPE) >> > +TYPE_METHOD_BASETYPE (t); >> > + if (TREE_CODE (t) == OFFSET_TYPE) >> > +TYPE_OFFSET_BASETYPE (t); that looks like a no-op to me ... (both are TYPE_MAXVAL which is already fixed up). Thus, ok with removing the above hunk. Thanks, Richard. >> > } >> > >> > /* Fix up fields of a BINFO T. */
Re: handle isl and cloog in contrib/download_prerequisites
On Mon, Oct 8, 2012 at 3:16 AM, Jonathan Wakely wrote: > On 7 October 2012 21:31, Manuel López-Ibáñez wrote: >> On 7 October 2012 22:13, Jonathan Wakely wrote: >>> >>> On Oct 7, 2012 12:00 AM, "NightStrike" wrote: On Sat, Oct 6, 2012 at 7:30 AM, Manuel López-Ibáñez wrote: > Hi, > > GCC now requires ISL and a very new CLOOG but download_prerequisites > does not download those. Also, there is only one sensible place to As of what version is isl/cloog no longer optional? >>> >>> If they're really no longer optional then the prerequisites page and 4.8 >>> changes page need to be updated. >>> >>> The patch downloads isl and cloog unconditionally, does gcc build them >>> unconditionally if they're found in the source dir? If they are still >>> optional I don't want download_prerequisites to fetch files that will slow >>> down building gcc by building libs and enabling features I don't use. >> >> I guess they are optional in the sense that you can configure gcc to >> not require them. But the default configure in x86_64-gnu-linux >> requires isl and cloog. > > Are you sure? > > Seems to me the default is still the same as it always has been, i.e. > Graphite optimisations can be enabled if ISL and cloog are present, > but they're not "required". I can bootstrap without ISL anyway. If good enough ISL and cloog are not found graphite is simply disabled unless you explicitely enabled it via specifying either of ISL or cloog configury. Richard.
Re: [lra] patch to speed more compilation of PR54146
On Sun, Oct 7, 2012 at 11:27 PM, Steven Bosscher wrote: > On Sun, Oct 7, 2012 at 5:59 PM, Vladimir Makarov wrote: >> The following patch speeds LRA up more on PR54146. Below times for >> compilation of the test on gcc17.fsffrance.org (an AMD machine): >> >> Before: >> real=1214.71 user=1192.05 system=22.48 >> After: >> real=1144.37 user=1124.31 system=20.11 > > Hi Vlad, > > The next bottle-neck in my timings is in > lra-eliminate.c:lra_eliminate(), in this loop: > >FOR_EACH_BB (bb) > FOR_BB_INSNS_SAFE (bb, insn, temp) >{ >if (bitmap_bit_p (&insns_with_changed_offsets, INSN_UID (insn))) > process_insn_for_elimination (insn, final_p); >} > > The problem is in bitmap_bit_p. Random access to a large bitmap can be > very slow. > > I'm playing with a patch to expand the insns_with_changed_offsets > bitmap to an sbitmap, and will send a patch if this works better. Or make insns_with_changed_offsets a VEC of insns (or a pointer-set). Richard. > Ciao! > Steven
Re: [ping patch] Predict for loop exits in short-circuit conditions
On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen wrote: > Attached is the updated patch. Yes, if we add a VRP pass before > profile pass, this patch would be unnecessary. Should we add a VRP > pass? No, we don't want VRP in early optimizations. Richard. > Thanks, > Dehao > > On Sat, Oct 6, 2012 at 9:38 AM, Jan Hubicka wrote: >>> ping^2 >>> >>> Honza, do you think this patch can make into 4.8 stage 1? >> >> + if (check_value_one ^ integer_onep (val)) >> >> Probably better as != >> (especially because GNU coding standard allows predicates to return more than >> just boolean) >> >> >> +{ >> + edge e1; >> + edge_iterator ei; >> + tree val = gimple_phi_arg_def (phi_stmt, i); >> + edge e = gimple_phi_arg_edge (phi_stmt, i); >> + >> + if (!TREE_CONSTANT (val) || !(integer_zerop (val) || integer_onep >> (val))) >> + continue; >> + if (check_value_one ^ integer_onep (val)) >> + continue; >> + if (VEC_length (edge, e->src->succs) != 1) >> + { >> + if (!predicted_by_p (exit_edge->src, PRED_LOOP_ITERATIONS_GUESSED) >> + && !predicted_by_p (exit_edge->src, PRED_LOOP_ITERATIONS) >> + && !predicted_by_p (exit_edge->src, PRED_LOOP_EXIT)) >> + predict_edge_def (e, PRED_LOOP_EXIT, NOT_TAKEN); >> + continue; >> + } >> + >> + FOR_EACH_EDGE (e1, ei, e->src->preds) >> + if (!predicted_by_p (exit_edge->src, PRED_LOOP_ITERATIONS_GUESSED) >> + && !predicted_by_p (exit_edge->src, PRED_LOOP_ITERATIONS) >> + && !predicted_by_p (exit_edge->src, PRED_LOOP_EXIT)) >> + predict_edge_def (e1, PRED_LOOP_EXIT, NOT_TAKEN); >> >> Here you found an edge that you know is going to terminate the loop >> and you want to predict all paths to this edge as unlikely. >> Perhaps you want to use predict paths leading_to_edge for edge? >> >> You do not need to check PRED_LOOP_ITERATIONS and >> PRED_LOOP_ITERATIONS_GUESSED >> because those never go to the non-exit edges. >> >> The nature of predict_paths_for_bb type heuristic is that they are not really >> additive: if the path leads to two different aborts it does not make it more >> sure that it will be unlikely. So perhaps you can add !predicted_by_p (e, >> pred) >> prior predict_edge_def call in the function? >> >> I wonder if we did VRP just before branch predction to jump thread the >> shortcut >> condtions into loopback edges, would be there still cases where this >> optimization will match? >> >> Honza
Re: [PATCH] Fix PR54489 - FRE needing AVAIL_OUT
On Fri, 5 Oct 2012, Steven Bosscher wrote: > On Fri, Sep 14, 2012 at 2:26 PM, Richard Guenther wrote: > > If you can figure out a better name for the function we should > > probably move it to cfganal.c > > It looks like my previous e-mail about this appears to have gone got > somehow, so retry: > > Your my_rev_post_order_compute is simply inverted_post_order_compute. > The only difference is that you'll want to ignore EXIT_BLOCK, which is > always added to the list by inverted_post_order_compute. Indeed. inverted_post_order_compute seems to handle a CFG without infinite-loop and noreturns connected to exit though. Possibly that's why it doesn't care for not handling entry/exit. I'm testing a patch to use inverted_post_order_compute from PRE. Richard.
Re: vec_cond_expr adjustments
On Fri, Oct 5, 2012 at 5:01 PM, Marc Glisse wrote: > [I am still a little confused, sorry for the long email...] > > > On Tue, 2 Oct 2012, Richard Guenther wrote: > >>>>> + if (TREE_CODE (op0) == VECTOR_CST && TREE_CODE (op1) == VECTOR_CST) >>>>> +{ >>>>> + int count = VECTOR_CST_NELTS (op0); >>>>> + tree *elts = XALLOCAVEC (tree, count); >>>>> + gcc_assert (TREE_CODE (type) == VECTOR_TYPE); >>>>> + >>>>> + for (int i = 0; i < count; i++) >>>>> + { >>>>> + tree elem_type = TREE_TYPE (type); >>>>> + tree elem0 = VECTOR_CST_ELT (op0, i); >>>>> + tree elem1 = VECTOR_CST_ELT (op1, i); >>>>> + >>>>> + elts[i] = fold_relational_const (code, elem_type, >>>>> + elem0, elem1); >>>>> + >>>>> + if(elts[i] == NULL_TREE) >>>>> + return NULL_TREE; >>>>> + >>>>> + elts[i] = fold_negate_const (elts[i], elem_type); >>>> >>>> >>>> >>>> I think you need to invent something new similar to STORE_FLAG_VALUE >>>> or use STORE_FLAG_VALUE here. With the above you try to map >>>> {0, 1} to {0, -1} which is only true if the operation on the element >>>> types >>>> returns {0, 1} (thus, STORE_FLAG_VALUE is 1). >>> >>> >>> Er, seems to me that constant folding of a scalar comparison in the >>> front/middle-end only returns {0, 1}. > > [and later] > >> I'd say adjust your fold-const patch to not negate the scalar result >> but build a proper -1 / 0 value based on integer_zerop(). > > > I don't mind doing it that way, but I would like to understand first. > LT_EXPR on scalars is guaranteed (in generic.texi) to be 0 or 1. So negating > should be the same as testing with integer_zerop to build -1 or 0. Is it > just a matter of style (then I am ok), or am I missing a reason which makes > the negation wrong? Just a matter of style. Negating is a lot less descriptive for the actual set of return values we produce. >> The point is we need to define some semantics for vector comparison >> results. > > > Yes. I think a documentation patch should come first: generic.texi is > missing an entry for VEC_COND_EXPR and the entry for LT_EXPR doesn't mention > vectors. But before that we need to decide what to put there... > > >> One variant is to make it target independent which in turn >> would inhibit (or make it more difficult) to exploit some target features. >> You for example use {0, -1} for truth values - probably to exploit target >> features - > > > Actually it was mostly because that is the meaning in the language. OpenCL > says that a the elements in the condition. The fact that it matches what some targets do > is a simple consequence of the fact that OpenCL was based on what hardware > already did. Yes, it seems that the {0, -1} choice is most reasonable for GENERIC. So let's document that. > >> even though the most natural middle-end way would be to >> use {0, 1} as for everything else > > > I agree that it would be natural and convenient in a number of places. > > >> (caveat: there may be both signed and unsigned bools, we don't allow >> vector components with non-mode precision, thus you could argue that a >> signed bool : 1 is just "sign-extended" for your solution). > > > Not sure how that would translate in the code. > > >> A different variant is to make it target dependent to leverage >> optimization opportunities > > > That's an interesting possibility... > > >> that's why STORE_FLAG_VALUE exists. > > > AFAICS it only appears when we go from gimple to rtl, not before (and there > is already a VECTOR_STORE_FLAG_VALUE, although no target defines it). Which > doesn't mean we couldn't make it appear earlier for vectors. > > >> For example with vector comparisons a < v result, when >> performing bitwise operations on it, you either have to make the target >> expand code to produce {0, -1} even if the natural compare instruction >> would, say, produce {0, 0x8} - or not constrain the possible values >> of its result (like forwprop would do with your patch). In general we >> want constant folding to yield the same results as if the HW carried >> out the operation to make -O0 code not diverge from -O1. Thus, >> >> v4si g
Re: patch to fix constant math - second small patch
On Sat, Oct 6, 2012 at 12:48 AM, Kenneth Zadeck wrote: > This patch adds machinery to genmodes.c so that largest possible sizes of > various data structures can be determined at gcc build time. These > functions create 3 symbols that are available in insn-modes.h: > MAX_BITSIZE_MODE_INT - the bitsize of the largest int. > MAX_BITSIZE_MODE_PARTIAL_INT - the bitsize of the largest partial int. > MAX_BITSIZE_MODE_ANY_INT - the largest bitsize of any kind of int. Ok. Please document these macros in rtl.texi. Richard.
Re: patch to fix constant math - third small patch
On Sat, Oct 6, 2012 at 5:55 PM, Kenneth Zadeck wrote: > This is the third patch in the series of patches to fix constant math. > this one changes some predicates at the rtl level to use the new predicate > CONST_SCALAR_INT_P. > I did not include a few that were tightly intertwined with other changes. > > Not all of these changes are strictly mechanical. Richard, when reviewing > this had me make additional changes to remove what he thought were latent > bugs at the rtl level. However, it appears that the bugs were not latent. > I do not know what is going on here but i am smart enough to not look a gift > horse in the mouth. > > All of this was done on the same machine with no changes and identical > configs. It is an x86-64 with ubuntu 12-4. > > ok for commit? Patch missing, but if it's just mechanical changes and introduction of CONST_SCALAR_INT_P consider it pre-approved. Richard. > in the logs below, gbBaseline is a trunk from friday and the gbWide is the > same revision but with my patches. Some of this like gfortran.dg/pr32627 is > obviously flutter, but the rest does not appear to be. > > = > heracles:~/gcc(13) gccBaseline/contrib/compare_tests > gbBaseline/gcc/testsuite/gcc/gcc.log gbWide/gcc/testsuite/gcc/gcc.log > New tests that PASS: > > gcc.dg/builtins-85.c scan-assembler mysnprintf > gcc.dg/builtins-85.c scan-assembler-not __chk_fail > gcc.dg/builtins-85.c (test for excess errors) > > > heracles:~/gcc(14) gccBaseline/contrib/compare_tests > gbBaseline/gcc/testsuite/gfortran/gfortran.log > gbWide/gcc/testsuite/gfortran/gfortran.log > New tests that PASS: > > gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer -funroll-loops (test for > excess errors) > gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer (test for excess errors) > gfortran.dg/pr32627.f03 -Os (test for excess errors) > gfortran.dg/pr32635.f -O0 execution test > gfortran.dg/pr32635.f -O0 (test for excess errors) > gfortran.dg/substr_6.f90 -O2 (test for excess errors) > > Old tests that passed, that have disappeared: (Eeek!) > > gfortran.dg/pr32627.f03 -O1 (test for excess errors) > gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer -funroll-all-loops > -finline-functions (test for excess errors) > gfortran.dg/pr32627.f03 -O3 -g (test for excess errors) > gfortran.dg/substring_equivalence.f90 -O (test for excess errors) > Using /home/zadeck/gcc/gccBaseline/gcc/testsuite/config/default.exp as > tool-and-target-specific interface file. > > === g++ Summary === > > # of expected passes49793 > # of expected failures284 > # of unsupported tests601 > > runtest completed at Fri Oct 5 16:10:20 2012 > heracles:~/gcc(16) tail gbWide/gcc/testsuite/g++/g++.log Using > /usr/share/dejagnu/config/unix.exp as generic interface file for target. > Using /home/zadeck/gcc/gccWide/gcc/testsuite/config/default.exp as > tool-and-target-specific interface file. > > === g++ Summary === > > # of expected passes50472 > # of expected failures284 > # of unsupported tests613 > > runtest completed at Fri Oct 5 19:51:50 2012 > > > > >
Re: Check that unlinked uses do not contain ssa-names when renaming.
On Sun, Oct 7, 2012 at 12:44 PM, Tom de Vries wrote: > Richard, > > attached patch checks that unlinked uses do not contain ssa-names when > renaming. > > This assert triggers when compiling (without the fix) the PR54735 example. > > AFAIU, it was due to chance that we caught the PR54735 bug by hitting the > verification failure, because the new vdef introduced by renaming happened to > be > the same name as the ssa name referenced in the invalid unlinked use (in terms > of maybe_replace_use: rdef == use). > > The assert from this patch catches all cases that an unlinked use contains an > ssa-name. > > Bootstrapped and reg-tested on x86_64 (Ada inclusive). > > OK for trunk? I don't think that is exactly what we should assert here ... (I thought about adding checking myself ...). What we'd want to assert is that before any new DEF is registered (which may re-allocate an SSA name) that no uses with SSA_NAME_IN_FREELIST appear. Thus, a light verification pass would be necessary at the beginning of update_ssa (which I queued onto my TODO list ...). We'd want that anyway to for example catch the case where a non-virtual operand is partially renamed. Thanks, Richard. > Thanks, > - Tom > > 2012-10-07 Tom de Vries > > * tree-into-ssa.c (maybe_replace_use): Add assert.
Re: patch to fix constant math
On Sun, Oct 7, 2012 at 4:58 PM, Kenneth Zadeck wrote: > > On 10/07/2012 09:19 AM, Richard Guenther wrote: >>> >>> >In fact, you could argue that the tree level did it wrong (not that i am >>> >suggesting to change this). But it makes me think what was going on >>> > when >>> >the decision to make TYPE_PRECISION be an INT_CST rather than just a HWI >>> > was >>> >made. For that there is an implication that it could never take more >>> > than >>> >a HWI since no place in the code even checks TREE_INT_CST_HIGH for >>> > these. >> >> Well - on the tree level we now always have two HWIs for all INTEGER_CSTs. >> If >> we can, based on the size of the underlying mode, reduce that to one >> HWI we already >> win something. If we add an explicit length to allow a smaller >> encoding for larger modes >> (tree_base conveniently has an available 'int' for this ...) then we'd >> win in more cases. >> Thus, is CONST_INT really necessarily better than optimized CONST_WIDE_INT >> storage? > > i have to admit, that looking at these data structures gives me a headache. > This all looks like something that Rube Goldberg would have invented had he > done object oriented design (Richard S did not know who Rube Goldberg when > i mentioned this name to him a few years ago since this is an American > thing, but the british had their own equivalent and I assume the germans do > too.). > > i did the first cut of changing the rtl level structure and Richard S threw > up on it and suggested what is there now, which happily (for me) i was able > to get mike to implement. > > mike also did the tree level version of the data structures for me. i will > make sure he used that left over length field. > > The bottom line is that you most likely just save the length, but that is a > big percent of something this small. Both of rtl ints have a mode, so if we > can make that change later, it will be space neutral. Yes. Btw, as for Richards idea of conditionally placing the length field in rtx_def looks like overkill to me. These days we'd merely want to optimize for 64bit hosts, thus unconditionally adding a 32 bit field to rtx_def looks ok to me (you can wrap that inside a union to allow both descriptive names and eventual different use - see what I've done to tree_base) Richard.
Re: [ping patch] Predict for loop exits in short-circuit conditions
On Mon, Oct 8, 2012 at 11:04 AM, Jan Hubicka wrote: >> On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen wrote: >> > Attached is the updated patch. Yes, if we add a VRP pass before >> > profile pass, this patch would be unnecessary. Should we add a VRP >> > pass? >> >> No, we don't want VRP in early optimizations. > > I am not quite sure about that. VRP > 1) makes branch prediction work better by doing jump threading early Well ... but jump threading may need basic-block duplication which may increase code size. Also VRP and FRE have pass ordering issues. > 2) is, after FRE, most effective tree pass on removing code by my profile > statistics. We also don't have DSE in early opts. I don't want to end up with the situation that we do everything in early opts ... we should do _less_ there (but eventually iterate properly when processing cycles). Richard. > But that would require more analysis. > The patch is OK. > Honza
Re: Fixup INTEGER_CST
On Mon, Oct 8, 2012 at 11:18 AM, Jan Hubicka wrote: >> On Sun, Oct 7, 2012 at 7:22 PM, Jan Hubicka wrote: >> >> On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka wrote: >> >> > Hi, >> >> > I added a santy check that after fixup all types that lost in the >> >> > merging are >> >> > really dead. And it turns out we have some zombies around. >> >> > >> >> > INTEGER_CST needs special care because it is special cased by the >> >> > streamer. We also >> >> > do not want to do inplace modificaitons on it because that would >> >> > corrupt the hashtable >> >> > used by tree.c's sharing code >> >> > >> >> > Bootstrapped/regtested x86_64-linux, OK? >> >> >> >> No, I don't think we want to fixup INTEGER_CSTs this way. Instead we >> >> want to fixup >> >> them where they end up used unfixed. >> > >> > Erm, I think it is what the patch does? >> >> Ah, indeed. >> >> > It replaces pointers to integer_cst with type that did not survive by >> > pointer >> > to new integer_cst. (with the optimization that INTEGER_CST with overflow >> > is changed in place because it is allowed to do so). >> >> Btw ... >> >> >> > @@ -1526,6 +1549,11 @@ lto_ft_type (tree t) >> >> >LTO_FIXUP_TREE (t->type_non_common.binfo); >> >> > >> >> >LTO_FIXUP_TREE (TYPE_CONTEXT (t)); >> >> > + >> >> > + if (TREE_CODE (t) == METHOD_TYPE) >> >> > +TYPE_METHOD_BASETYPE (t); >> >> > + if (TREE_CODE (t) == OFFSET_TYPE) >> >> > +TYPE_OFFSET_BASETYPE (t); >> >> that looks like a no-op to me ... (both are TYPE_MAXVAL which >> is already fixed up). > > Ah, indeed. They were result of experimenting with the stale pointers to the > obsoletted types and field decls. I now understand where they come from. The > reason is twofold. > > 1) after merging records we replace field decls in the cache > by new ones. This however does not mean that they die, because > the existing pointers to them are not replaced. > I have WIP patch for that that however require one extra pass > over the list of all trees. Yes, I think this is also why we do /* ??? Not sure the above is all relevant in this path canonicalizing TYPE_FIELDS to that of the main variant. */ if (ix < i) lto_fixup_types (f2); streamer_tree_cache_insert_at (cache, f1, ix); something I dislike as well and something we should try to address in a more formal way. > 2) As we query the type_hash while we are rewritting the types, > we run into instability of the hashtable. This manifests itself > as an ICE when one adds sanity check that while merging function > types their arg types are equivalent, too. > This ICEs compiling i.e. sqlite but I did not really managed to > reduce this. I tracked it down to the argument type being inserted > into gimple_type_hash but at the time we query the new argument type, > the original is no longer found despite their hashes are equivalent. > The problem is hidden when things fit into the leader cache, > so one needs rather big testcase. Ugh. For reduction you can disable those caches though. The above means there is a disconnect between hashing and comparing. Maybe it's something weird with the early out if (TYPE_ARG_TYPES (t1) == TYPE_ARG_TYPES (t2)) goto same_types; ? > So I tried to register all gimple types first. Use TREE_VISITED within > the merging code to mark that type is not a leader and then TREE_CHAIN > to point to the leader. This avoids need to re-query the hashtable > from the later fixups. We only look for types with TREEE_VISITED > and replace them by TREE_CHAIN. TREE_CHAIN is unused for types? But we probably shouldn't add a new use ... > This has two passes. First we compute the main variants and mark > field_decls and type_decls for merging and in last pass we finally do > fixup on what remained in the table. > > This allows me to poison pointers in the removed types in a way > so the GGC would ICE if they stayed reachable. > I however need the extra pass because > 1) I can not update the type_decls/field_decls while registering > types or I run into the hash table problems > 2) I can not merge the second two passes because at the time > I find type/field decls equialent there may be earlier pointers > to them. You need to "merge" all trees reachable from the one you start at once (what I'm working on from time to time - work per tree "SCC", in a DFS walk). Richard. > Honza
Re: vec_cond_expr adjustments
On Mon, Oct 8, 2012 at 11:34 AM, Marc Glisse wrote: > On Mon, 8 Oct 2012, Richard Guenther wrote: > >>> VEC_COND_EXPR is more complicated. We could for instance require that it >>> takes as first argument a vector of -1 and 0 (thus <0, !=0 and the neon >>> thing are equivalent). Which would leave to decide what the expansion of >>> vec_cond_expr passes to the targets when the first argument is not a >>> comparison, between !=0, <0, ==-1 or others (I vote for <0 because of >>> opencl). One issue is that targets wouldn't know if it was a dummy >>> comparison that can safely be ignored because the other part is the >>> result >>> of logical operations on comparisons (thus composed of -1 and 0) or a >>> genuine comparison with an arbitrary vector, so a new optimization would >>> be >>> needed (in the back-end I guess or we would need an alternate instruction >>> to >>> vcond) to detect if a vector is a "signed boolean" vector. >>> We could instead say that vec_cond_expr really follows OpenCL's semantics >>> and looks at the MSB of each element. I am not sure that would change >>> much, >>> it would mostly delay the apparition of <0 to RTL expansion time (and >>> thus >>> make gimple slightly lighter). >> >> >> I think we should delay the decision on how to optimize this. It's indeed >> not trivial and the GIMPLE middle-end aggressively forwards feeding >> comparisons into the VEC_COND_EXPR expressions already (somewhat >> defeating any CSE that might be possible here) in forwprop. > > > Thanks for going through the long email :-) > > What does that imply for the first argument of VEC_COND_EXPR? Currently, the > expander asserts that it is a comparison, but that is not reflected in the > gimple checkers. And I don't think we should reflect that in the gimple checkers rather fixup the expander (transparently use p != 0 or p < 0). > If we document that VEC_COND_EXPR takes a vector of -1 and 0 (which is the > case for a comparison), I don't think it prevents from later relaxing that > to <0 or !=0. But then I don't know how to handle expansion when the > argument is neither a comparison (vcond) nor a constant (vec_merge? I > haven't tried but that should be doable), I would have to pass <0 or !=0 to > the target. Yes. > So is the best choice to document that VEC_COND_EXPR takes as > first argument a comparison and make gimple checking reflect that? (seems > sad, but at least that would tell me what I can/can't do) No, that would just mean that in GIMPLE you'd add this p != 0 or p < 0. And at some point in the future I really really want to push this embedded expression to a separate statement so I have a SSA definition for it. > By the way, since we are documenting comparisons as returning 0 and -1, does > that bring back the integer_truep predicate? Not sure, true would still be != 0 or all_onesp (all bits of the precision are 1), no? Richard. > -- > Marc Glisse
Re: [RFC] Make vectorizer to skip loops with small iteration estimate
On Sat, Oct 6, 2012 at 11:34 AM, Jan Hubicka wrote: > Hi, > I benchmarked the patch moving loop header copying and it is quite noticeable > win. > > Some testsuite updating is needed. In many cases it is just because the > optimizations are now happening earlier. > There are however few testusite failures I have torubles to deal with: > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times > vrp1 "Threaded jump" 3 > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c > scan-tree-dump-times vrp1 "Jumps threaded: 1" 1 > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c > scan-tree-dump-times vect "vectorized 1 loops" 2 > ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98 > scan-tree-dump-times vrp1 "if " 1 > ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11 > scan-tree-dump-times vrp1 "if " 1 > > This is mostly about VRP losing its ability to thread some jumps from the > duplicated loop header out of the loop across the loopback edge. This seems > to > be due to loop updating logic. Do we care about these? Yes, I think so. At least we care that the optimized result is the same. Can you elaborate on "due to loop updating logic"? Can you elaborate on the def_split_header_continue_p change? Which probably should be tested and installed separately? Thanks, Richard. > Honza > > Index: tree-ssa-threadupdate.c > === > *** tree-ssa-threadupdate.c (revision 192123) > --- tree-ssa-threadupdate.c (working copy) > *** static bool > *** 846,854 > def_split_header_continue_p (const_basic_block bb, const void *data) > { > const_basic_block new_header = (const_basic_block) data; > ! return (bb != new_header > ! && (loop_depth (bb->loop_father) > ! >= loop_depth (new_header->loop_father))); > } > > /* Thread jumps through the header of LOOP. Returns true if cfg changes. > --- 846,860 > def_split_header_continue_p (const_basic_block bb, const void *data) > { > const_basic_block new_header = (const_basic_block) data; > ! const struct loop *l; > ! > ! if (bb == new_header > ! || loop_depth (bb->loop_father) < loop_depth > (new_header->loop_father)) > ! return false; > ! for (l = bb->loop_father; l; l = loop_outer (l)) > ! if (l == new_header->loop_father) > ! return true; > ! return false; > } > > /* Thread jumps through the header of LOOP. Returns true if cfg changes. > Index: testsuite/gcc.dg/unroll_2.c > === > *** testsuite/gcc.dg/unroll_2.c (revision 192123) > --- testsuite/gcc.dg/unroll_2.c (working copy) > *** > *** 1,5 > /* { dg-do compile { target i?86-*-linux* x86_64-*-linux* } } */ > ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops > -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo > -fenable-rtl-loop2_unroll" } */ > > unsigned a[100], b[100]; > inline void bar() > --- 1,5 > /* { dg-do compile { target i?86-*-linux* x86_64-*-linux* } } */ > ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops > -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo > -fenable-rtl-loop2_unroll -fno-tree-dominator-opts" } */ > > unsigned a[100], b[100]; > inline void bar() > Index: testsuite/gcc.dg/unroll_3.c > === > *** testsuite/gcc.dg/unroll_3.c (revision 192123) > --- testsuite/gcc.dg/unroll_3.c (working copy) > *** > *** 1,5 > /* { dg-do compile { target i?86-*-linux* x86_64-*-linux* } } */ > ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops > -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo" > } */ > > unsigned a[100], b[100]; > inline void bar() > --- 1,5 > /* { dg-do compile { target i?86-*-linux* x86_64-*-linux* } } */ > ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops > -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo > -fno-tree-dominator-opts" } */ > > unsigned a[100], b[100]; > inline void bar() > Index: testsuite/gcc.dg/torture/pr23821.c > === > *** testsuite/gcc.dg/torture/pr23821.c (revision 192123) > --- testsuite/gcc.dg/torture/pr23821.c (working copy) > *** > *** 1,9 > /* { dg-do compile } */ > /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */ > ! /* At -O1 DOM threads a jump in a non-optimal way which leads to > the bogus propagation. */ > ! /* { dg-skip-if "" { *-*-* } { "-O1" } { "" } } */ > ! /* { dg-options "-fdump-tree-ivcanon-details" } */ > > int a[199]; > > --- 1,8 > /* { dg-do compile } */ > /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */ > ! /* DOM threads a jump in a non-o
Re: [ping patch] Predict for loop exits in short-circuit conditions
On Mon, Oct 8, 2012 at 12:01 PM, Jan Hubicka wrote: >> On Mon, Oct 8, 2012 at 11:04 AM, Jan Hubicka wrote: >> >> On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen wrote: >> >> > Attached is the updated patch. Yes, if we add a VRP pass before >> >> > profile pass, this patch would be unnecessary. Should we add a VRP >> >> > pass? >> >> >> >> No, we don't want VRP in early optimizations. >> > >> > I am not quite sure about that. VRP >> > 1) makes branch prediction work better by doing jump threading early >> >> Well ... but jump threading may need basic-block duplication which may >> increase code size. Also VRP and FRE have pass ordering issues. >> >> > 2) is, after FRE, most effective tree pass on removing code by my profile >> > statistics. >> >> We also don't have DSE in early opts. I don't want to end up with the >> situation that we do everything in early opts ... we should do _less_ there >> (but eventually iterate properly when processing cycles). > > Yep, i am not quite sure about most sane variant. Missed simple jump > threading > in early opts definitely confuse both profile estimate and inline size > estimates. But I am also not thrilled by adding more passes to early opts at > all. Also last time I looked into this, CCP missed a lot of CCP oppurtunities > making VRP to artifically look like more useful. Eh .. that shouldn't happen. Do you have testcases by any chance? I used to duplicate each SSA propagator pass and checked -fdump-statistics-stats for that the 2nd pass does nothing (thus chaining CCP doesn't improve results). But maybe that's not the issue you run into here? > Have patch that bit improves profile updating after jump threading (i.e. > re-does the profile for simple cases), but still jump threading is the most > common case for profile become inconsistent after expand. > > On a related note, with -fprofile-report I can easilly track how much of code > each pass in the queue removed. I was thinking about running this on Mozilla > and -O1 and removing those passes that did almost nothing. Those are mostly > re-run passes, both at Gimple and RTL level. Our passmanager is not terribly > friendly for controlling pass per-repetition. Sure. You can also more thorougly instrument passes and use -fdump-statistics for that (I've done that), but we usually have testcases that require that each pass that still is there is present ... > With introduction of -Og pass queue, do you think introducing -O1 pass queue > for late tree passes (that will be quite short) is sane? Yes. I don't like the dump-file naming mess that results though, but if we want to support optimized attribute switching between -O1 and -O2 then I guess we have to live with that ... Originally I wanted to base -Og on -O1 (thus have them mostly share the pass queue) and retain the same pass queue for -O2 and -Os. Maybe that's what we eventually want to do. Thus, add a (off for -Og) loop optimizer sub-pass to the queue and schedule some scalar cleanups after it but inside it. > What about RTL > level? I guess we can split the queues for RTL optimizations, too. > All optimizations passes prior register allocation are sort of optional > and I guess there are also -Og candidates. Yes. Though I first wanted to see actual issues with the RTL optimizers and -Og. > I hoever find the 3 times duplicated queues bit uncool, too, but I guess > it is most compatible with PM organization. Indeed ;) We should at least try to share the queues for -Og and -O1. > At -O3 the most effective passes on combine.c > are: > > cfg (because of cfg cleanup) -1.5474% > Early inlning -0.4991% > FRE -7.9369% > VRP -0.9321% (if run early), ccp does -0.2273% I think VRP has the advantage of taking loop iteration counts into account. Maybe we can add sth similar to CCP. It's sad that VRP is too expensive, it really is a form of CCP so merging both passes would be best (we can at a single point, add_equivalence, turn off equivalence processing - the most expensive part of VRP, and call that CCP ...). > tailr -0.5305% > > After IPA > copyrename -2.2850% (it packs cleanups after inlining) > forwprop -0.5432% > VRP -0.9700% (if rerun after early passes, otherwise it is about 2%) > PRE -2.4123% > DOM -0.5182% > > RTL passes > into_cfglayout -3.1400% (i.e. first cleanup_cfg) > fwprop1 -3.0467% > cprop -2.7786% > combine -3.3346% > IRA -3.4912% (i.e. the cost model preffers hard regs) > bbro -0.9765% > > The numbers on tramp3d and LTO cc1 binary and not that different. Yes. Richard. > Honza
Re: [Patch] Fix PR53397
On Mon, Oct 8, 2012 at 12:01 PM, Kumar, Venkataramanan wrote: > Hi Richard, > > I have incorporated your comments. > >> Yes, call dump_mem_ref then, instead of repeating parts of its body. > > Reference object is not yet created at the place we check for invariance. It > is still a tree expression. I created a common function and used at all > places to dump the "step", "base" and "delta" values of memory reference > being analyzed. > > Please find the modified patch attached. > > GCC regression "make check -k" passes with x86_64-unknown-linux-gnu. I presume also bootstrapped. Ok. Thanks, Richard. > Regards, > Venkat. > > -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Thursday, October 04, 2012 6:26 PM > To: Kumar, Venkataramanan > Cc: Richard Guenther; gcc-patches@gcc.gnu.org > Subject: Re: [Patch] Fix PR53397 > > On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan > wrote: >> Hi Richi, >> >> (Snip) >>> + (!cst_and_fits_in_hwi (step)) >>> +{ >>> + if( loop->inner != NULL) >>> +{ >>> + if (dump_file && (dump_flags & TDF_DETAILS)) >>> +{ >>> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >>> + fprintf (dump_file, "(base " ); >>> + print_generic_expr (dump_file, base, TDF_SLIM); >>> + fprintf (dump_file, ", step "); >>> + print_generic_expr (dump_file, step, TDF_TREE); >>> + fprintf (dump_file, ")\n"); >> >> No need to repeat this - all references are dumped when we gather them. >> (Snip) >> >> The dumping happens at "record_ref" which is called after these statements >> to record these references. >> >> When the step is invariant we return from the function without recording >> the references. >> >> so I thought of dumping the references here. >> >> Is there a cleaner way to dump the references at one place? > > Yes, call dump_mem_ref then, instead of repeating parts of its body. > > Richard. > >> Regards, >> Venkat. >> >> >> >> -Original Message- >> From: Richard Guenther [mailto:rguent...@suse.de] >> Sent: Tuesday, October 02, 2012 5:42 PM >> To: Kumar, Venkataramanan >> Cc: gcc-patches@gcc.gnu.org >> Subject: Re: [Patch] Fix PR53397 >> >> On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: >> >>> Hi, >>> >>> The below patch fixes the FFT/Scimark regression caused by useless >>> prefetch generation. >>> >>> This fix tries to make prefetch less aggressive by prefetching arrays >>> in the inner loop, when the step is invariant in the entire loop nest. >>> >>> GCC currently tries to prefetch invariant steps when they are in the >>> inner loop. But does not check if the step is variant in outer loops. >>> >>> In the scimark FFT case, the trip count of the inner loop varies by a >>> non constant step, which is invariant in the inner loop. >>> But the step variable is varying in outer loop. This makes inner loop >>> trip count small (at run time varies sometimes as small as 1 >>> iteration) >>> >>> Prefetching ahead x iteration when the inner loop trip count is >>> smaller than x leads to useless prefetches. >>> >>> Flag used: -O3 -march=amdfam10 >>> >>> Before >>> ** ** >>> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >>> ** for details. (Results can be submitted to p...@nist.gov) ** >>> ** ** >>> Using 2.00 seconds min time per kenel. >>> Composite Score: 550.50 >>> FFT Mflops:38.66(N=1024) >>> SOR Mflops: 617.61(100 x 100) >>> MonteCarlo: Mflops: 173.74 >>> Sparse matmult Mflops: 675.63(N=1000, nz=5000) >>> LU Mflops: 1246.88(M=100, N=100) >>> >>> >>> After >>> ** ** >>> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >>> ** for details. (Results can be submitted to p...@nist.gov) ** >>> ** **
[PATCH] Remove my_rev_post_order_compute
This replaces my_rev_post_order_compute in PRE by the already existing inverted_post_order_compute, with the necessary adjustments. Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2012-10-08 Richard Guenther * tree-ssa-pre.c (postorder_num): New global. (compute_antic): Initialize all blocks and adjust for generic postorder. (my_rev_post_order_compute): Remove. (init_pre): Use inverted_post_order_compute. Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 192119) +++ gcc/tree-ssa-pre.c (working copy) @@ -430,6 +430,7 @@ typedef struct bb_bitmap_sets /* Basic block list in postorder. */ static int *postorder; +static int postorder_num; /* This structure is used to keep track of statistics on what optimization PRE was able to perform. */ @@ -2456,7 +2457,7 @@ compute_antic (void) has_abnormal_preds = sbitmap_alloc (last_basic_block); sbitmap_zero (has_abnormal_preds); - FOR_EACH_BB (block) + FOR_ALL_BB (block) { edge_iterator ei; edge e; @@ -2480,9 +2481,7 @@ compute_antic (void) } /* At the exit block we anticipate nothing. */ - ANTIC_IN (EXIT_BLOCK_PTR) = bitmap_set_new (); BB_VISITED (EXIT_BLOCK_PTR) = 1; - PA_IN (EXIT_BLOCK_PTR) = bitmap_set_new (); changed_blocks = sbitmap_alloc (last_basic_block + 1); sbitmap_ones (changed_blocks); @@ -2496,7 +2495,7 @@ compute_antic (void) for PA ANTIC computation. */ num_iterations++; changed = false; - for (i = n_basic_blocks - NUM_FIXED_BLOCKS - 1; i >= 0; i--) + for (i = postorder_num - 1; i >= 0; i--) { if (TEST_BIT (changed_blocks, postorder[i])) { @@ -2525,7 +2524,7 @@ compute_antic (void) fprintf (dump_file, "Starting iteration %d\n", num_iterations); num_iterations++; changed = false; - for (i = n_basic_blocks - NUM_FIXED_BLOCKS - 1 ; i >= 0; i--) + for (i = postorder_num - 1 ; i >= 0; i--) { if (TEST_BIT (changed_blocks, postorder[i])) { @@ -4593,78 +4592,6 @@ remove_dead_inserted_code (void) BITMAP_FREE (worklist); } -/* Compute a reverse post-order in *POST_ORDER. If INCLUDE_ENTRY_EXIT is - true, then then ENTRY_BLOCK and EXIT_BLOCK are included. Returns - the number of visited blocks. */ - -static int -my_rev_post_order_compute (int *post_order, bool include_entry_exit) -{ - edge_iterator *stack; - int sp; - int post_order_num = 0; - sbitmap visited; - - if (include_entry_exit) -post_order[post_order_num++] = EXIT_BLOCK; - - /* Allocate stack for back-tracking up CFG. */ - stack = XNEWVEC (edge_iterator, n_basic_blocks + 1); - sp = 0; - - /* Allocate bitmap to track nodes that have been visited. */ - visited = sbitmap_alloc (last_basic_block); - - /* None of the nodes in the CFG have been visited yet. */ - sbitmap_zero (visited); - - /* Push the last edge on to the stack. */ - stack[sp++] = ei_start (EXIT_BLOCK_PTR->preds); - - while (sp) -{ - edge_iterator ei; - basic_block src; - basic_block dest; - - /* Look at the edge on the top of the stack. */ - ei = stack[sp - 1]; - src = ei_edge (ei)->src; - dest = ei_edge (ei)->dest; - - /* Check if the edge source has been visited yet. */ - if (src != ENTRY_BLOCK_PTR && ! TEST_BIT (visited, src->index)) -{ - /* Mark that we have visited the destination. */ - SET_BIT (visited, src->index); - - if (EDGE_COUNT (src->preds) > 0) -/* Since the SRC node has been visited for the first - time, check its predecessors. */ -stack[sp++] = ei_start (src->preds); - else -post_order[post_order_num++] = src->index; -} - else -{ - if (ei_one_before_end_p (ei) && dest != EXIT_BLOCK_PTR) -post_order[post_order_num++] = dest->index; - - if (!ei_one_before_end_p (ei)) -ei_next (&stack[sp - 1]); - else -sp--; -} -} - - if (include_entry_exit) -post_order[post_order_num++] = ENTRY_BLOCK; - - free (stack); - sbitmap_free (visited); - return post_order_num; -} - /* Initialize data structures used by PRE. */ @@ -4686,9 +4613,8 @@ init_pre (void) connect_infinite_loops_to_exit (); memset (&pre_stats, 0, sizeof (pre_stats)); - - postorder = XNEWVEC (int, n_basic_blocks - NUM_FIXED_BLOCKS); - my_rev_post_order_compute (postorder, false); + postorder = XNEWVEC (int, n_basic_blocks); + postorder_num = inverted_post_order_compute (postorder); alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
[PATCH] Fix PR54825
This fixes PR54825, properly FRE/PRE vector BIT_FIELD_REFs. Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2012-10-08 Richard Guenther PR tree-optimization/54825 * tree-ssa-sccvn.c (vn_nary_length_from_stmt): Handle BIT_FIELD_REF. (init_vn_nary_op_from_stmt): Likewise. * tree-ssa-pre.c (compute_avail): Use vn_nary_op_lookup_stmt. * tree-ssa-sccvn.h (sizeof_vn_nary_op): Avoid overflow. Index: gcc/tree-ssa-sccvn.c === *** gcc/tree-ssa-sccvn.c(revision 192120) --- gcc/tree-ssa-sccvn.c(working copy) *** vn_nary_length_from_stmt (gimple stmt) *** 2194,2199 --- 2194,2202 case VIEW_CONVERT_EXPR: return 1; + case BIT_FIELD_REF: + return 3; + case CONSTRUCTOR: return CONSTRUCTOR_NELTS (gimple_assign_rhs1 (stmt)); *** init_vn_nary_op_from_stmt (vn_nary_op_t *** 2220,2225 --- 2223,2235 vno->op[0] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 0); break; + case BIT_FIELD_REF: + vno->length = 3; + vno->op[0] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 0); + vno->op[1] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 1); + vno->op[2] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 2); + break; + case CONSTRUCTOR: vno->length = CONSTRUCTOR_NELTS (gimple_assign_rhs1 (stmt)); for (i = 0; i < vno->length; ++i) *** init_vn_nary_op_from_stmt (vn_nary_op_t *** 2227,2232 --- 2237,2243 break; default: + gcc_checking_assert (!gimple_assign_single_p (stmt)); vno->length = gimple_num_ops (stmt) - 1; for (i = 0; i < vno->length; ++i) vno->op[i] = gimple_op (stmt, i + 1); Index: gcc/tree-ssa-pre.c === *** gcc/tree-ssa-pre.c (revision 192120) --- gcc/tree-ssa-pre.c (working copy) *** compute_avail (void) *** 3850,3860 || code == VEC_COND_EXPR) continue; ! vn_nary_op_lookup_pieces (gimple_num_ops (stmt) - 1, ! code, ! gimple_expr_type (stmt), ! gimple_assign_rhs1_ptr (stmt), ! &nary); if (!nary) continue; --- 3850,3856 || code == VEC_COND_EXPR) continue; ! vn_nary_op_lookup_stmt (stmt, &nary); if (!nary) continue; Index: gcc/tree-ssa-sccvn.h === *** gcc/tree-ssa-sccvn.h(revision 192120) --- gcc/tree-ssa-sccvn.h(working copy) *** typedef const struct vn_nary_op_s *const *** 51,57 static inline size_t sizeof_vn_nary_op (unsigned int length) { ! return sizeof (struct vn_nary_op_s) + sizeof (tree) * (length - 1); } /* Phi nodes in the hashtable consist of their non-VN_TOP phi --- 51,57 static inline size_t sizeof_vn_nary_op (unsigned int length) { ! return sizeof (struct vn_nary_op_s) + sizeof (tree) * length - sizeof (tree); } /* Phi nodes in the hashtable consist of their non-VN_TOP phi
Re: patch to fix constant math - third small patch
On Mon, Oct 8, 2012 at 1:36 PM, Kenneth Zadeck wrote: > yes, my bad. here it is with the patches. Just for the record, ok! Thanks, Richard. > On 10/06/2012 11:55 AM, Kenneth Zadeck wrote: >> >> This is the third patch in the series of patches to fix constant math. >> this one changes some predicates at the rtl level to use the new predicate >> CONST_SCALAR_INT_P. >> I did not include a few that were tightly intertwined with other changes. >> >> Not all of these changes are strictly mechanical. Richard, when >> reviewing this had me make additional changes to remove what he thought were >> latent bugs at the rtl level. However, it appears that the bugs were not >> latent.I do not know what is going on here but i am smart enough to not >> look a gift horse in the mouth. >> >> All of this was done on the same machine with no changes and identical >> configs. It is an x86-64 with ubuntu 12-4. >> >> ok for commit? >> >> in the logs below, gbBaseline is a trunk from friday and the gbWide is the >> same revision but with my patches. Some of this like gfortran.dg/pr32627 is >> obviously flutter, but the rest does not appear to be. >> >> = >> heracles:~/gcc(13) gccBaseline/contrib/compare_tests >> gbBaseline/gcc/testsuite/gcc/gcc.log gbWide/gcc/testsuite/gcc/gcc.log >> New tests that PASS: >> >> gcc.dg/builtins-85.c scan-assembler mysnprintf >> gcc.dg/builtins-85.c scan-assembler-not __chk_fail >> gcc.dg/builtins-85.c (test for excess errors) >> >> >> heracles:~/gcc(14) gccBaseline/contrib/compare_tests >> gbBaseline/gcc/testsuite/gfortran/gfortran.log >> gbWide/gcc/testsuite/gfortran/gfortran.log >> New tests that PASS: >> >> gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer -funroll-loops (test for >> excess errors) >> gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer (test for excess >> errors) >> gfortran.dg/pr32627.f03 -Os (test for excess errors) >> gfortran.dg/pr32635.f -O0 execution test >> gfortran.dg/pr32635.f -O0 (test for excess errors) >> gfortran.dg/substr_6.f90 -O2 (test for excess errors) >> >> Old tests that passed, that have disappeared: (Eeek!) >> >> gfortran.dg/pr32627.f03 -O1 (test for excess errors) >> gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer -funroll-all-loops >> -finline-functions (test for excess errors) >> gfortran.dg/pr32627.f03 -O3 -g (test for excess errors) >> gfortran.dg/substring_equivalence.f90 -O (test for excess errors) >> Using /home/zadeck/gcc/gccBaseline/gcc/testsuite/config/default.exp as >> tool-and-target-specific interface file. >> >> === g++ Summary === >> >> # of expected passes49793 >> # of expected failures284 >> # of unsupported tests601 >> >> runtest completed at Fri Oct 5 16:10:20 2012 >> heracles:~/gcc(16) tail gbWide/gcc/testsuite/g++/g++.log Using >> /usr/share/dejagnu/config/unix.exp as generic interface file for target. >> Using /home/zadeck/gcc/gccWide/gcc/testsuite/config/default.exp as >> tool-and-target-specific interface file. >> >> === g++ Summary === >> >> # of expected passes50472 >> # of expected failures284 >> # of unsupported tests613 >> >> runtest completed at Fri Oct 5 19:51:50 2012 >> >> >> >> >> >
Re: [RFC] Implement load sinking in loops
On Mon, Oct 8, 2012 at 12:38 PM, Eric Botcazou wrote: > Hi, > > we recently noticed that, even at -O3, the compiler doesn't figure out that > the following loop is dumb: > > #define SIZE 64 > > int foo (int v[]) > { > int r; > > for (i = 0; i < SIZE; i++) > r = v[i]; > > return r; > } > > which was a bit of a surprise. On second thoughts, this isn't entirely > unexpected, as it probably matters only for (slightly) pathological cases. > The attached patch nevertheless implements a form of load sinking in loops so > as to optimize these cases. It's combined with invariant motion to optimize: > > int foo (int v[], int a) > { > int r, i; > > for (i = 0; i < SIZE; i++) > r = v[i] + a; > > return r; > } > > and with store sinking to optimize: > > int foo (int v1[], int v2[]) > { > int r[SIZE]; > int i, j; > > for (j = 0; j < SIZE; j++) > for (i = 0; i < SIZE; i++) > r[j] = v1[j] + v2[i]; > > return r[SIZE - 1]; > } > > The optimization is enabled at -O2 in the patch for measurement purposes but, > given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap, > compiler-only, all languages except Go), it's probably best suited to -O3. > Or perhaps we don't care and it should simply be dropped... Thoughts? Incidentially we have scev-const-prop to deal with the similar case of scalar computations. But I realize this doesn't work for expressions that are dependent on a loop variant load. @@ -103,6 +103,7 @@ typedef struct mem_ref_loc { tree *ref; /* The reference itself. */ gimple stmt; /* The statement in that it occurs. */ + tree lhs;/* The (ultimate) LHS for a load. */ } *mem_ref_loc_p; isn't that the lhs of stmt? +static gimple_seq +copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs) +{ + tree mem = *loc->ref; + tree lhs, tmp_var, ssa_name; + gimple_seq seq = NULL; + gimple stmt; + unsigned n = 0; + + /* First copy the load and create the new LHS for it. */ + lhs = gimple_assign_lhs (loc->stmt); + tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++)); use make_temp_ssa_name or simply copy_ssa_name (not sure you need fancy names here). + if (gimple_assign_rhs1 (use_stmt) == lhs) + { + op1 = ssa_name; + op2 = gimple_assign_rhs2 (use_stmt); + } + else + { + op1 = gimple_assign_rhs1 (use_stmt); + op2 = ssa_name; + } this may enlarge lifetime of the other operand? And it looks like it would break with unary stmts (accessing out-of-bounds op2). Also for is_gimple_min_invariant other operand which may be for example &a.b you need to unshare_expr it. + lhs = gimple_assign_lhs (use_stmt); + tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++)); + stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2); + ssa_name = make_ssa_name (tmp_var, stmt); + gimple_assign_set_lhs (stmt, ssa_name); see above. This can now be simplified to lhs = gimple_assign_lhs (use_stmt); ssa_name = copy_ssa_name (lhs, NULL); stmt = gimple_build_assign_with_ops (rhs_code, ssa_name, op1, op2); Btw - isn't this all a bit backward (I mean the analysis in execute_lm?) What you want is apply this transform to as much of the _DEF_s of the loop-closed PHI nodes - only values used outside of the loop are interesting. Thats (sort-of) what SCEV const-prop does (well, it also uses SCEV to compute the overall effect of the iterations). So what you want to know is whether when walking the DEF chain of the loop closed PHI you end up at definitions before the loop or at definitions that are not otherwise used inside the loop. Which means it is really expression sinking. Does tree-ssa-sink manage to sink anything out of a loop? Even scalar computation parts I mean? For for (..) { a = x[i]; y[i] = a; b = a * 2; } ... = b; it should be able to sink b = a*2. So I think the more natural place to implement this is either SCEV cprop or tree-ssa-sink.c. And process things from the loop-closed PHI use walking the DEFs (first process all, marking interesting things to also catch commonly used exprs for two PHI uses). Again you might simply want to open a bugreport for this unless you want to implement it yourself. Thanks, Richard. > Tested on x86_64-suse-linux. > > > 2012-10-08 Eric Botcazou > > * gimple.h (gsi_insert_seq_on_edge_before): Declare. > * gimple-iterator.c (gsi_insert_seq_on_edge_before): New function. > * tree-ssa-loop-im.c (struct mem_ref_loc): Add LHS field. > (mem_ref_in_stmt): Remove gcc_assert. > (copy_load_and_single_use_chain): New function. > (execute_lm): Likewise. > (hoist_memory_references): Hoist the loads after the stores. > (ref_always_accessed_p): Rename into... > (ref_always_stored_p): ...this. Remove STORE_P and add ONCE_P. >
Re: [RFC] Implement load sinking in loops
On Mon, Oct 8, 2012 at 2:32 PM, Richard Guenther wrote: > On Mon, Oct 8, 2012 at 12:38 PM, Eric Botcazou wrote: >> Hi, >> >> we recently noticed that, even at -O3, the compiler doesn't figure out that >> the following loop is dumb: >> >> #define SIZE 64 >> >> int foo (int v[]) >> { >> int r; >> >> for (i = 0; i < SIZE; i++) >> r = v[i]; >> >> return r; >> } >> >> which was a bit of a surprise. On second thoughts, this isn't entirely >> unexpected, as it probably matters only for (slightly) pathological cases. >> The attached patch nevertheless implements a form of load sinking in loops so >> as to optimize these cases. It's combined with invariant motion to optimize: >> >> int foo (int v[], int a) >> { >> int r, i; >> >> for (i = 0; i < SIZE; i++) >> r = v[i] + a; >> >> return r; >> } >> >> and with store sinking to optimize: >> >> int foo (int v1[], int v2[]) >> { >> int r[SIZE]; >> int i, j; >> >> for (j = 0; j < SIZE; j++) >> for (i = 0; i < SIZE; i++) >> r[j] = v1[j] + v2[i]; >> >> return r[SIZE - 1]; >> } >> >> The optimization is enabled at -O2 in the patch for measurement purposes but, >> given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap, >> compiler-only, all languages except Go), it's probably best suited to -O3. >> Or perhaps we don't care and it should simply be dropped... Thoughts? > > Incidentially we have scev-const-prop to deal with the similar case of > scalar computations. But I realize this doesn't work for expressions that > are dependent on a loop variant load. > > @@ -103,6 +103,7 @@ typedef struct mem_ref_loc > { >tree *ref; /* The reference itself. */ >gimple stmt; /* The statement in that it occurs. */ > + tree lhs;/* The (ultimate) LHS for a load. */ > } *mem_ref_loc_p; > > isn't that the lhs of stmt? > > +static gimple_seq > +copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs) > +{ > + tree mem = *loc->ref; > + tree lhs, tmp_var, ssa_name; > + gimple_seq seq = NULL; > + gimple stmt; > + unsigned n = 0; > + > + /* First copy the load and create the new LHS for it. */ > + lhs = gimple_assign_lhs (loc->stmt); > + tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++)); > > use make_temp_ssa_name or simply copy_ssa_name (not sure you need > fancy names here). > > + if (gimple_assign_rhs1 (use_stmt) == lhs) > + { > + op1 = ssa_name; > + op2 = gimple_assign_rhs2 (use_stmt); > + } > + else > + { > + op1 = gimple_assign_rhs1 (use_stmt); > + op2 = ssa_name; > + } > > this may enlarge lifetime of the other operand? And it looks like it would > break with unary stmts (accessing out-of-bounds op2). Also for > is_gimple_min_invariant other operand which may be for example &a.b > you need to unshare_expr it. > > + lhs = gimple_assign_lhs (use_stmt); > + tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, > n++)); > + stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2); > + ssa_name = make_ssa_name (tmp_var, stmt); > + gimple_assign_set_lhs (stmt, ssa_name); > > see above. This can now be simplified to > >lhs = gimple_assign_lhs (use_stmt); >ssa_name = copy_ssa_name (lhs, NULL); >stmt = gimple_build_assign_with_ops (rhs_code, ssa_name, op1, op2); > > Btw - isn't this all a bit backward (I mean the analysis in execute_lm?) > What you want is apply this transform to as much of the _DEF_s of > the loop-closed PHI nodes - only values used outside of the loop are > interesting. Thats (sort-of) what SCEV const-prop does (well, it also > uses SCEV to compute the overall effect of the iterations). So what > you want to know is whether when walking the DEF chain of the > loop closed PHI you end up at definitions before the loop or at > definitions that are not otherwise used inside the loop. > > Which means it is really expression sinking. Does tree-ssa-sink manage > to sink anything out of a loop? Even scalar computation parts I mean? For > > for (..) >{ > a = x[i]; > y[i] = a; > b = a * 2; >} > ... = b; > > it should be able to sink b = a*2. > > So I think the more natural place to implement this is either SCEV cprop > or tree-ssa-sink.c. And process things fro
Re: gcc/lto/lto.c: Free lto_file struct after closing the file
On Mon, Oct 8, 2012 at 2:39 PM, Tobias Burnus wrote: > lto_obj_file_open allocates: > lo = XCNEW (struct lto_simple_object); > However, the data is never freed - neither explicitly nor in > lto_obj_file_close. > > In the attached patch, I free the memory now after the call to > lto_obj_file_close. > > Build and regtested on x86-64-gnu-linux. > OK for the trunk? Ok. Thanks, Richard. > Tobias
Re: patch to fix constant math
On Mon, Oct 8, 2012 at 5:01 PM, Nathan Froyd wrote: > - Original Message - >> Btw, as for Richards idea of conditionally placing the length field >> in >> rtx_def looks like overkill to me. These days we'd merely want to >> optimize for 64bit hosts, thus unconditionally adding a 32 bit >> field to rtx_def looks ok to me (you can wrap that inside a union to >> allow both descriptive names and eventual different use - see what >> I've done to tree_base) > > IMHO, unconditionally adding that field isn't "optimize for 64-bit > hosts", but "gratuitously make one of the major compiler data > structures bigger on 32-bit hosts". Not everybody can cross-compile > from a 64-bit host. And even those people who can don't necessarily > want to. Please try to consider what's best for all the people who > use GCC, not just the cases you happen to be working with every day. The challenge would of course be to have the overhead only for a minority of all RTX codes. After all that 32bits are free to be used for every one. And I would not consider RTX a 'major compiler data structure' - of course that makes the whole issue somewhat moot ;) Richard. > -Nathan
Re: patch to fix constant math
On Mon, Oct 8, 2012 at 9:55 PM, Richard Sandiford wrote: > Robert Dewar writes: >> On 10/8/2012 11:01 AM, Nathan Froyd wrote: >>> - Original Message - Btw, as for Richards idea of conditionally placing the length field in rtx_def looks like overkill to me. These days we'd merely want to optimize for 64bit hosts, thus unconditionally adding a 32 bit field to rtx_def looks ok to me (you can wrap that inside a union to allow both descriptive names and eventual different use - see what I've done to tree_base) >>> >>> IMHO, unconditionally adding that field isn't "optimize for 64-bit >>> hosts", but "gratuitously make one of the major compiler data >>> structures bigger on 32-bit hosts". Not everybody can cross-compile >>> from a 64-bit host. And even those people who can don't necessarily >>> want to. Please try to consider what's best for all the people who >>> use GCC, not just the cases you happen to be working with every day. >> >> I think that's rasonable in general, but as time goes on, and every >> $300 laptop is 64-bit capable, one should not go TOO far out of the >> way trying to make sure we can compile everything on a 32-bit machine. > > It's not 64-bit machine vs. 32-bit machine. It's an LP64 ABI vs. > an ILP32 ABI. HJ & co. have put considerable effort into developing > the x32 ABI for x86_64 precisely because ILP32 is still useful for > 64-bit machines. Just as it was for MIPS when SGI invented n32 > (which is still useful now). I believe 64-bit SPARC has a similar > thing, and no doubt other architectures do too. > > After all, there shouldn't be much need for more than 2GB of virtual > address space in an AVR cross compiler. So why pay the cache penalty > of 64-bit pointers and longs (GCC generally tries to avoid using "long" > directly) when a 32-bit pointer will do? > > Many years ago, I moved the HOST_WIDE_INT fields out of rtunion > and into the main rtx_def union because it produced a significant > speed-up on n32 IRIX. That was before tree-level optimisation, But of course that doesn't change the alignment requirement of that union which is the issue on lp64 hosts. Also as HOST_WIDE_INT is 64bits for most ilp32 targets as well it doesn't necessarily save ilp32 hosts from having this issue (unless long long isn't aligned to 8 bytes for them). So what I said is that arranging for the 32bit hole to contain useful data for most RTXen should be possible. Richard. > but I don't think we've really pruned that much RTL optimisation > since then, so I'd be surprised if much has changed.
Re: Small cleanup/memory leak plugs for lto
On Mon, Oct 8, 2012 at 10:05 PM, Tobias Burnus wrote: > Some more issues found by Coverity scanner. > > lto-cgraph.c: The code seems to be unused, besides, it's a zero-trip loop as > parm_num is set to 0 and then checked non nonzeroness. > > lto-opts: The check whether first_p is non NULL is always false: All calls > have a variable ref as argument - and first_p is unconditionally > dereferenced. > > lto_obj_file_open: One could check additionally check "lo" is NULL, but that > has then to be directly after the XCNEW as already lto_file_init > dereferences "lo". static void append_to_collect_gcc_options (struct obstack *ob, bool *first_p, const char *opt) { const char *p, *q = opt; - if (!first_p) -obstack_grow (ob, " ", 1); bogus change. It should be if (!*first_p) obstack_grow (ob, " ", 1); The rest looks ok. Please re-test with the above change. Thanks, Richard. > Build and regtested on x86-64-gnu-linux > > Tobias
Re: [lra] 3rd patch to speed more compilation of PR54146
On Tue, Oct 9, 2012 at 6:10 PM, Vladimir Makarov wrote: > On 10/08/2012 05:14 PM, Steven Bosscher wrote: >> >> On Mon, Oct 8, 2012 at 10:26 PM, Vladimir Makarov >> wrote: >> >>> So I checked it on big file with > hundred functionson Intel machine and >>> got >>> >>> before a part of the patch implementing the insn stack as sbitmap >>> real=243.40 user=241.61 system=1.00 >>> >>> after the part of the patch: >>> real=244.89 user=243.02 system=0.96 >> >> Is that more than just noise, you think? A ~1.5s difference on ~240 >> total isn't very much. I measured the timings on my set of cc1-i >> files, and sometimes the without-patch compiler was faster by a tiny >> amount, and sometimes it was slower. Even on an average of 10 runs I >> really couldn't say that the patch was a win or loss on the whole. >> >> I measured this on a mostly idle machine at home, not gcc17, which >> seems to be even more busy than usual lately, thanks to you and me :-) >> >> > Sorry, Steven. It was a noise. I ran it again now 3 times and found that > was a noise. > > After some thinking, I realized that sbitmap is really the best > representation for this particular case. That is because at the 1st > constraint pass we have all bits are set as we process *all* insns on the > 1st pass. So sbitmap (at least before the extension and if we have pretty > compact UIDs) will be always smaller than bitmap. > > I committed the following your patch to the branch (as rev. 192264). > > And again, sorry for the false conclusions. Btw, congratulations for all the speedups (even though they probably are noise for "regular" programs?)! I'm looking forward to the merge of LRA for x86 now. Richard. > 2012-10-09 Steven Bosscher > > > * lra-int.h (lra_constraint_insn_stack_bitmap, > lra_constraint_insn_stack): Remove. > (lra_pop_insn, lra_insn_stack_length): New prototypes. > * lra.c (lra_constraint_insn_stack_bitmap): Make static sbitmap. > (lra_constraint_insn_stack): Make static. > (lra_push_insn_1): New function. > (lra_push_insn): Rewrite using lra_push_insn_1. > (lra_push_insn_and_update_insn_regno_info): Likewise. > (lra_pop_insn, lra_insn_stack_length): New functions. > * lra_constraints.c (lra_constraints): Use new interface to > insns stack instead of manipulating in-place. >
Re: [PATCH] Rename __gnu_slim_lto to __gnu_lto_slim
On Sun, Oct 16, 2011 at 1:12 AM, Andi Kleen wrote: > From: Andi Kleen > > Some external tools special case __gnu_lto* symbols, and the new > __gnu_slim_lto > was the only LTO symbol not matching this pattern. Since I don't think > there are any users rename it to __gnu_lto_slim > > Passes bootstrap and test suite on x86_64-linux. Ok? Ok. Thanks, Richard. > Cc: hubi...@ucw.cz > > gcc/: > > 2011-10-13 Andi Kleen > > * toplev.c (compile_file): Rename __gnu_slim_lto to __gnu_lto_slim. > --- > gcc/toplev.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/gcc/toplev.c b/gcc/toplev.c > index ab6b5a4..63e229e 100644 > --- a/gcc/toplev.c > +++ b/gcc/toplev.c > @@ -651,13 +651,13 @@ compile_file (void) > { > #if defined ASM_OUTPUT_ALIGNED_DECL_COMMON > ASM_OUTPUT_ALIGNED_DECL_COMMON (asm_out_file, NULL_TREE, > - "__gnu_slim_lto", > + "__gnu_lto_slim", > (unsigned HOST_WIDE_INT) 1, 8); > #elif defined ASM_OUTPUT_ALIGNED_COMMON > - ASM_OUTPUT_ALIGNED_COMMON (asm_out_file, "__gnu_slim_lto", > + ASM_OUTPUT_ALIGNED_COMMON (asm_out_file, "__gnu_lto_slim", > (unsigned HOST_WIDE_INT) 1, 8); > #else > - ASM_OUTPUT_COMMON (asm_out_file, "__gnu_slim_lto", > + ASM_OUTPUT_COMMON (asm_out_file, "__gnu_lto_slim", > (unsigned HOST_WIDE_INT) 1, > (unsigned HOST_WIDE_INT) 1); > #endif > -- > 1.7.5.4 > >
Re: [PATCH 1/5] Use MADV_DONTNEED for freeing in garbage collector
On Sun, Oct 16, 2011 at 7:30 AM, Andi Kleen wrote: > Andi Kleen writes: > >> From: Andi Kleen >> >> Use the Linux MADV_DONTNEED call to unmap free pages in the garbage >> collector.Then keep the unmapped pages in the free list. This avoid >> excessive memory fragmentation on large LTO bulds, which can lead >> to gcc bumping into the Linux vm_max_map limit per process. > > Could I have a decision on this patch please? The problem in PR50636 > is still there and this is the minimum fix to fix it on Linux > as far as I know. > > If this patch is not the right way to go I would > appreciate some guidance on an alternative (but low cost) > implementation. Note I don't have capacity for any overly > complicated solutions. The patch looks generally ok, but you are never giving back pages to the system, and as we have other memory allocations that do not use the ggc pools you drain virtual memory on 32bit hosts. Is any other patch in this series compensating for it? If not I'd say we should munmap the pages when a full mapped range (2MB) is free. Can you rename 'unmapped' to 'discarded' please? That would be less confusing. Thanks, Richard. > Thanks, > > -Andi > > -- > a...@linux.intel.com -- Speaking for myself only >
Re: [C++ Patch] PR 32614
On Sun, Oct 16, 2011 at 12:31 PM, Paolo Carlini wrote: > On 10/16/2011 12:28 PM, Gabriel Dos Reis wrote: >> >> On Sun, Oct 16, 2011 at 5:03 AM, Paolo Carlini >> wrote: >>> >>> Hi, >>> >>> in this simple documentation PR, Tom noticed that we have a (very long >>> standing) inconsistency between the default value of -fmessage-length for >>> C++ as documented and as implemented: in fact it's 0 in >>> cxx-pretty-print.c, >>> like all the other front-ends. At the time of the PR people briefly >>> envisage >>> changing the default but the discussion didn't go anywhere, I think we >>> can >>> as well do the below, for the time being at least, remove the >>> inconsistency. >>> >>> Ok? >> >> I still think the default for g++ should be 72. Notice that other >> front-ends have set it to zero because they feared something, unlike the >> C++ frontend. > > To be clear, I have no strong opinion. But last time Richard Gunther > strongly disagreed (and now he is a Global Reviewer ;) Thus, just let me > know guys... 0 is just so much more convenient for consumers and all consumers that care for line lengths can properly wrap around. So I don't see a good reason to have -fmessage-length at all. Richard. > Paolo. >
Re: [C++ Patch] PR 32614
On Sun, Oct 16, 2011 at 1:03 PM, Gabriel Dos Reis wrote: > On Sun, Oct 16, 2011 at 5:42 AM, Richard Guenther > wrote: >> On Sun, Oct 16, 2011 at 12:31 PM, Paolo Carlini >> wrote: >>> On 10/16/2011 12:28 PM, Gabriel Dos Reis wrote: >>>> >>>> On Sun, Oct 16, 2011 at 5:03 AM, Paolo Carlini >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> in this simple documentation PR, Tom noticed that we have a (very long >>>>> standing) inconsistency between the default value of -fmessage-length for >>>>> C++ as documented and as implemented: in fact it's 0 in >>>>> cxx-pretty-print.c, >>>>> like all the other front-ends. At the time of the PR people briefly >>>>> envisage >>>>> changing the default but the discussion didn't go anywhere, I think we >>>>> can >>>>> as well do the below, for the time being at least, remove the >>>>> inconsistency. >>>>> >>>>> Ok? >>>> >>>> I still think the default for g++ should be 72. Notice that other >>>> front-ends have set it to zero because they feared something, unlike the >>>> C++ frontend. >>> >>> To be clear, I have no strong opinion. But last time Richard Gunther >>> strongly disagreed (and now he is a Global Reviewer ;) Thus, just let me >>> know guys... >> >> 0 is just so much more convenient for consumers and all consumers that >> care for line lengths can properly wrap around. So I don't see a good >> reason to have -fmessage-length at all. > > This is an argument that should have been made more than a decade ago > and -fmessage-length was *requested* by users who did care about > line length, and implemented for g++. I wasn't around at that time, so sorry for not raising my opinion earlier ;) Richard.
Re: [PATCH] Clear DECL_GIMPLE_REG_P when making parameter copy addressable (PR tree-optimization/50735)
On Sun, Oct 16, 2011 at 5:47 PM, Jakub Jelinek wrote: > Hi! > > gimplify_parameters uses create_tmp_reg, but sometimes it decides to make > it addressable (if the PARM_DECL is addressable). If so, it must not be > DECL_GIMPLE_REG_P. > > Alternatively we could call create_tmp_reg only if !TREE_ADDRESSABLE and > call create_tmp_var instead for TREE_ADDRESSABLE (+ set TREE_ADDRESSABLE). > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? I think this should be exactly the other way around, using create_tmp_var, copying TREE_ADDRESSABLE and setting DECL_GIMPLE_REG_P if it is not addressable. Ok with that change. Thanks, Richard. > 2011-10-16 Jakub Jelinek > > PR tree-optimization/50735 > * function.c (gimplify_parameters): When making local TREE_ADDRESSABLE, > also clear DECL_GIMPLE_REG_P. > > --- gcc/function.c.jj 2011-10-14 08:21:56.0 +0200 > +++ gcc/function.c 2011-10-15 12:43:23.0 +0200 > @@ -3624,7 +3624,10 @@ gimplify_parameters (void) > not the PARMs. Keep the parms address taken > as we'll query that flag during gimplification. */ > if (TREE_ADDRESSABLE (parm)) > - TREE_ADDRESSABLE (local) = 1; > + { > + TREE_ADDRESSABLE (local) = 1; > + DECL_GIMPLE_REG_P (local) = 0; > + } > } > else > { > > Jakub >
Re: [PATCH 1/5] Use MADV_DONTNEED for freeing in garbage collector
On Sun, Oct 16, 2011 at 8:33 PM, Andi Kleen wrote: > On Sun, Oct 16, 2011 at 12:38:16PM +0200, Richard Guenther wrote: >> On Sun, Oct 16, 2011 at 7:30 AM, Andi Kleen wrote: >> > Andi Kleen writes: >> > >> >> From: Andi Kleen >> >> >> >> Use the Linux MADV_DONTNEED call to unmap free pages in the garbage >> >> collector.Then keep the unmapped pages in the free list. This avoid >> >> excessive memory fragmentation on large LTO bulds, which can lead >> >> to gcc bumping into the Linux vm_max_map limit per process. >> > >> > Could I have a decision on this patch please? The problem in PR50636 >> > is still there and this is the minimum fix to fix it on Linux >> > as far as I know. >> > >> > If this patch is not the right way to go I would >> > appreciate some guidance on an alternative (but low cost) >> > implementation. Note I don't have capacity for any overly >> > complicated solutions. >> >> The patch looks generally ok, but you are never giving back pages to the > > It gives back pages, just not virtual address space. But I guess that is > what you meant. > > On the other hand this patch can actually give you more virtual > address space when you need large regions (>2 pages or so). > The reason is that the old allocation pattern fragments the whole > address space badly and only leaves these small holes. With the madvise > patch that does not happen, ggc is all in a compacted chunk. Sure, but we do compete with the glibc heap with virtual memory usage (I wonder if GGC should simply use malloc/free ...). So I am worried that we run out of address space earlier this way. But I guess we can revisit this when we run into actual problems ... >> system, and as we have other memory allocations that do not use the >> ggc pools you drain virtual memory on 32bit hosts. Is any other patch >> in this series compensating for it? If not I'd say we should munmap the >> pages when a full mapped range (2MB) is free. Can you rename > > I wrote such a patch initially, but ran into various problems, so > I dropped it from the series. I can revisit it. Yes, please revisit it. It should be as simple as scanning for a large chunk in free_pages I suppose. >> 'unmapped' to 'discarded' please? That would be less confusing. > > Ok I can do that. > > Was that an approval? Ok with the rename. Thanks, Richard. > -Andi >
Re: [patch] Fix PR tree-optimization/49960 ,Fix self data dependence
On Mon, Oct 17, 2011 at 8:23 AM, Razya Ladelsky wrote: > This patch fixes the failures described in > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49960 > It also fixes bzips when run with autopar enabled. > > In both cases the self dependences are not handled correctly. > In the first case, a non affine access is analyzed: > in the second, the distance vector is not calculated correctly (the > distance vector considered for for self dependences is always (0,0,...)) > > As a result, the loops get wrongfully parallelized. > > The patch avoids the special handling of self dependences, and analyzes > all dependences in the same way. Specific adjustments > and support for the self dependence cases were made. Can you elaborate on @@ -3119,8 +3135,11 @@ add_other_self_distances (struct data_dependence_r { if (DDR_NUM_SUBSCRIPTS (ddr) != 1) { - DDR_ARE_DEPENDENT (ddr) = chrec_dont_know; - return; + if (DDR_NUM_SUBSCRIPTS (ddr) != 2 || !integer_zerop (DR_ACCESS_FN (DDR_A (ddr), 1))) + { + DDR_ARE_DEPENDENT (ddr) = chrec_dont_know; + return; + } } access_fun = DR_ACCESS_FN (DDR_A (ddr), 0); ? It needed a comment before, and now so even more. The rest of the patch is ok, I suppose the above hunk is to enhance something, not to fix the bug? Thanks, Richard. > Bootstrap and testsuite pass successfully for ppc64-redhat-linux. > > OK for trunk? > Thank you, > Razya > > > ChangeLog: > > PR tree-optimization/49960 > * tree-data-ref.c (compute_self_dependence): Remove. > (initialize_data_dependence_relation): Add intializations. > Remove compute_self_dependence. > (add_other_self_distances): Add support for two dimensions if > the second is zero. > (compute_affine_dependence): Remove the !DDR_SELF_REFERENCE > condition. > (compute_all_dependences): Remove call to > compute_self_dependence. Add call to compute_affine_dependence > > testsuite/ChangeLog: > > PR tree-optimization/49660 > * gcc.dg/autopar/pr49660.c: New test. > * gcc.dg/autopar/pr49660-1.c: New test. > > > > > > > > > > > > > > > >
Re: [C++ Patch] PR 32614
On Mon, Oct 17, 2011 at 11:42 AM, Paolo Carlini wrote: > FWIW, I still believe that tweaking the documentation to match the long > standing reality, would be a small improvement. I'm not going to further > insist, anyway. The original patch is ok. Thanks, Richard. > Paolo. >
Re: [C++ Patch] PR 32614
On Mon, Oct 17, 2011 at 12:26 PM, Gabriel Dos Reis wrote: > On Mon, Oct 17, 2011 at 4:42 AM, Paolo Carlini > wrote: >> FWIW, I still believe that tweaking the documentation to match the long >> standing reality, would be a small improvement. I'm not going to further >> insist, anyway. > > It isn't improvement. > The improvement would be to restore the documented default. I disagree. Well, both would be an improvement. Restoring the documented default would be a move in the wrong direction though. Why have different defaults for different languages anyway? How long has the "new" default be in effect? Richard.
Re: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs
On Fri, Oct 14, 2011 at 9:43 PM, Kai Tietz wrote: > Hello, > > So I committed the gimplify patch separate. And here is the remaining > fold-const patch. > The important tests here are in gcc.dg/tree-ssa/builtin-expect[1-4].c, which > cover the one special-case for branching. Also tree-ssa/20040204-1.c covers > tests for branching code (on targets having high-engough BRANCH_COST and no > special-casing - like MIPS, S/390, and AVR. > > ChangeLog > > 2011-10-14 Kai Tietz > > * fold-const.c (simple_operand_p_2): New function. > (fold_truthop): Rename to > (fold_truth_andor_1): function name. > Additionally remove branching creation for logical and/or. > (fold_truth_andor): Handle branching creation for logical and/or here. > > Bootstrapped and regression-tested for all languages plus Ada and > Obj-C++ on x86_64-pc-linux-gnu. > Ok for apply? Ok with ... > Regards, > Kai > > Index: gcc/gcc/fold-const.c > === > --- gcc.orig/gcc/fold-const.c > +++ gcc/gcc/fold-const.c > @@ -112,13 +112,13 @@ static tree decode_field_reference (loca > static int all_ones_mask_p (const_tree, int); > static tree sign_bit_p (tree, const_tree); > static int simple_operand_p (const_tree); > +static bool simple_operand_p_2 (tree); > static tree range_binop (enum tree_code, tree, tree, int, tree, int); > static tree range_predecessor (tree); > static tree range_successor (tree); > static tree fold_range_test (location_t, enum tree_code, tree, tree, tree); > static tree fold_cond_expr_with_comparison (location_t, tree, tree, > tree, tree); > static tree unextend (tree, int, int, tree); > -static tree fold_truthop (location_t, enum tree_code, tree, tree, tree); > static tree optimize_minmax_comparison (location_t, enum tree_code, > tree, tree, tree); > static tree extract_muldiv (tree, tree, enum tree_code, tree, bool *); > @@ -3500,7 +3500,7 @@ optimize_bit_field_compare (location_t l > return lhs; > } > > -/* Subroutine for fold_truthop: decode a field reference. > +/* Subroutine for fold_truth_andor_1: decode a field reference. > > If EXP is a comparison reference, we return the innermost reference. > > @@ -3668,7 +3668,7 @@ sign_bit_p (tree exp, const_tree val) > return NULL_TREE; > } > > -/* Subroutine for fold_truthop: determine if an operand is simple enough > +/* Subroutine for fold_truth_andor_1: determine if an operand is simple > enough > to be evaluated unconditionally. */ > > static int > @@ -3678,7 +3678,7 @@ simple_operand_p (const_tree exp) > STRIP_NOPS (exp); > > return (CONSTANT_CLASS_P (exp) > - || TREE_CODE (exp) == SSA_NAME > + || TREE_CODE (exp) == SSA_NAME > || (DECL_P (exp) > && ! TREE_ADDRESSABLE (exp) > && ! TREE_THIS_VOLATILE (exp) > @@ -3692,6 +3692,46 @@ simple_operand_p (const_tree exp) > registers aren't expensive. */ > && (! TREE_STATIC (exp) || DECL_REGISTER (exp; > } > + > +/* Subroutine for fold_truth_andor: determine if an operand is simple enough > + to be evaluated unconditionally. > + I addition to simple_operand_p, we assume that comparisons and logic-not > + operations are simple, if their operands are simple, too. */ > + > +static bool > +simple_operand_p_2 (tree exp) > +{ > + enum tree_code code; > + > + /* Strip any conversions that don't change the machine mode. */ > + STRIP_NOPS (exp); > + > + code = TREE_CODE (exp); > + > + if (TREE_CODE_CLASS (code) == tcc_comparison) > + return (!tree_could_trap_p (exp) > + && simple_operand_p_2 (TREE_OPERAND (exp, 0)) > + && simple_operand_p_2 (TREE_OPERAND (exp, 1))); recurse with simple_operand_p. > + > + if (TREE_SIDE_EFFECTS (exp) > + || tree_could_trap_p (exp)) Move this check before the tcc_comparison check and remove the then redundant tree_could_trap_p check there. > + return false; > + > + switch (code) > + { > + case SSA_NAME: > + return true; Do not handle here, it's handled in simple_operand_p. > + case TRUTH_NOT_EXPR: > + return simple_operand_p_2 (TREE_OPERAND (exp, 0)); > + case BIT_NOT_EXPR: > + if (TREE_CODE (TREE_TYPE (exp)) != BOOLEAN_TYPE) > + return false; Remove the BIT_NOT_EXPR handling. Thus, simply change this switch to if (code == TRUTH_NOT_EXPR) return simple_operand_p_2 (TREE_OPERAND (exp, 0)); return simple_operand_p (exp); > + return simple_operand_p_2 (TREE_OPERAND (exp, 0)); > + default: > + return simple_operand_p (exp); > + } > +} > + > > /* The following functions are subroutines to fold_range_test and allow it to > try to change a logical combination of comparisons into a range test. > @@ -4888,7 +4928,7 @@ fold_range_test (location_t loc, enum tr > return 0; > } > > -/* Subroutine for fold_truthop: C is an INTEGER_CST interpreted as a P > +/* Subroutine
Re: [PATCH 3/6] Emit macro expansion related diagnostics
On Mon, Oct 17, 2011 at 11:57 AM, Dodji Seketeli wrote: > In this third instalment the diagnostic machinery -- when faced with > the virtual location of a token resulting from macro expansion -- uses > the new linemap APIs to unwind the stack of macro expansions that led > to that token and emits a [hopefully] more useful message than what we > have today. > > diagnostic_report_current_module has been slightly changed to use the > location given by client code instead of the global input_location > variable. This results in more precise diagnostic locations in > general but then the patch adjusts some C++ tests which output changed > as a result of this. > > Three new regression tests have been added. > > The mandatory screenshot goes like this: > > [dodji@adjoa gcc]$ cat -n test.c > 1 #define OPERATE(OPRD1, OPRT, OPRD2) \ > 2 OPRD1 OPRT OPRD2; > 3 > 4 #define SHIFTL(A,B) \ > 5 OPERATE (A,<<,B) > 6 > 7 #define MULT(A) \ > 8 SHIFTL (A,1) > 9 > 10 void > 11 g () > 12 { > 13 MULT (1.0);/* 1.0 << 1; <-- so this is an error. */ > 14 } > > [dodji@adjoa gcc]$ ./cc1 -quiet -ftrack-macro-expansion test.c > test.c: In function 'g': > test.c:5:14: erreur: invalid operands to binary << (have 'double' and 'int') > test.c:2:9: note: in expansion of macro 'OPERATE' > test.c:5:3: note: expanded from here > test.c:5:14: note: in expansion of macro 'SHIFTL' > test.c:8:3: note: expanded from here > test.c:8:3: note: in expansion of macro 'MULT2' > test.c:13:3: note: expanded from here This broke bootstrap on x86_64-linux. /space/rguenther/src/svn/trunk/libcpp/line-map.c: In function 'source_location linemap_macro_map_loc_to_exp_point(const line_map*, source_location)': /space/rguenther/src/svn/trunk/libcpp/line-map.c:628:12: error: variable 'token_no' set but not used [-Werror=unused-but-set-variable] cc1plus: all warnings being treated as errors make[3]: *** [line-map.o] Error 1 > gcc/ChangeLog > 2011-10-15 Tom Tromey > Dodji Seketeli > > * gcc/diagnostic.h (diagnostic_report_current_module): Add a > location parameter. > * diagnostic.c (diagnostic_report_current_module): Add a location > parameter to the function definition. Use it instead of > input_location. Resolve the virtual location rather than just > looking up its map and risking to touch a resulting macro map. > (default_diagnostic_starter): Pass the relevant diagnostic > location to diagnostic_report_current_module. > * tree-diagnostic.c (maybe_unwind_expanded_macro_loc): New. > (virt_loc_aware_diagnostic_finalizer): Likewise. > (diagnostic_report_current_function): Pass the > relevant location to diagnostic_report_current_module. > * tree-diagnostic.h (virt_loc_aware_diagnostic_finalizer): Declare > new function. > * toplev.c (general_init): By default, use the new > virt_loc_aware_diagnostic_finalizer as diagnostic finalizer. > * Makefile.in: Add vec.h dependency to tree-diagnostic.c. > > gcc/cp/ChangeLog > 2011-10-15 Tom Tromey > Dodji Seketeli > > * error.c (cp_diagnostic_starter): Pass the relevant location to > diagnostic_report_current_module. > (cp_diagnostic_finalizer): Call virt_loc_aware_diagnostic_finalizer. > > gcc/testsuite/ChangeLog > 2011-10-15 Tom Tromey > Dodji Seketeli > > * lib/prune.exp (prune_gcc_output): Prune output referring to > included files. > * gcc.dg/cpp/macro-exp-tracking-1.c: New test. > * gcc.dg/cpp/macro-exp-tracking-2.c: Likewise. > * gcc.dg/cpp/macro-exp-tracking-3.c: Likewise. > * gcc.dg/cpp/pragma-diagnostic-2.c: Likewise. > > --- > gcc/ChangeLog | 21 +++ > gcc/Makefile.in | 3 +- > gcc/cp/ChangeLog | 7 + > gcc/cp/error.c | 5 +- > gcc/diagnostic.c | 13 +- > gcc/diagnostic.h | 2 +- > gcc/testsuite/ChangeLog | 10 ++ > gcc/testsuite/gcc.dg/cpp/macro-exp-tracking-1.c | 21 +++ > gcc/testsuite/gcc.dg/cpp/macro-exp-tracking-2.c | 21 +++ > gcc/testsuite/gcc.dg/cpp/macro-exp-tracking-3.c | 14 ++ > gcc/testsuite/gcc.dg/cpp/macro-exp-tracking-4.c | 14 ++ > gcc/testsuite/gcc.dg/cpp/pragma-diagnostic-2.c | 34 + > gcc/testsuite/lib/prune.exp | 1 + > gcc/toplev.c | 3 + > gcc/tree-diagnostic.c | 182 > ++- > gcc/tree-diagnostic.h | 3 +- > 16 files changed, 343 insertions(+), 11 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/cpp/macro-exp-tracking-1.c > create mode 100644 gcc/testsuite/gcc.dg/cpp/macro-exp-trac
Re: [PATCH] Fix pr50717: widening multiply bad code
On Sat, Oct 15, 2011 at 11:27 PM, Andrew Stubbs wrote: > This patch fixes a bug in which the widening multiply-and-accumulate > optimization failed to take the intermediate types into account. > > The effect of this is that the compiler would do what the programmer > expected to happen, rather than what the C standard requires to happen (in > many cases), so obviously this needed fixing. :) > > It still needs to optimize the cases where the optimization doesn't actually > change anything (because it's known that overflow cannot occur), so I don't > want to completely disallow extends between multiply and plus, and I believe > this patch achieves this. > > OK? Ok. THanks, Richard. > Andrew >
Re: [PR50672, PATCH] Fix ice triggered by -ftree-tail-merge: verify_ssa failed: no immediate_use list
On Sun, Oct 16, 2011 at 12:05 PM, Tom de Vries wrote: > On 10/14/2011 12:00 PM, Richard Guenther wrote: >> On Fri, Oct 14, 2011 at 1:12 AM, Tom de Vries wrote: >>> On 10/12/2011 02:19 PM, Richard Guenther wrote: >>>> On Wed, Oct 12, 2011 at 8:35 AM, Tom de Vries >>>> wrote: >>>>> Richard, >>>>> >>>>> I have a patch for PR50672. >>>>> >>>>> When compiling the testcase from the PR with -ftree-tail-merge, the >>>>> scenario is >>>>> as follows: >>>>> >>>>> We start out tail_merge_optimize with blocks 14 and 20, which are alike, >>>>> but not >>>>> equal, since they have different successors: >>>>> ... >>>>> # BLOCK 14 freq:690 >>>>> # PRED: 25 [61.0%] (false,exec) >>>>> >>>>> if (wD.2197_57(D) != 0B) >>>>> goto ; >>>>> else >>>>> goto ; >>>>> # SUCC: 15 [78.4%] (true,exec) 16 [21.6%] (false,exec) >>>>> >>>>> >>>>> # BLOCK 20 freq:2900 >>>>> # PRED: 29 [100.0%] (fallthru) 31 [100.0%] (fallthru) >>>>> >>>>> # .MEMD.2447_209 = PHI <.MEMD.2447_125(29), .MEMD.2447_129(31)> >>>>> if (wD.2197_57(D) != 0B) >>>>> goto ; >>>>> else >>>>> goto ; >>>>> # SUCC: 5 [85.0%] (true,exec) 6 [15.0%] (false,exec) >>>>> ... >>>>> >>>>> In the first iteration, we merge block 5 with block 15 and block 6 with >>>>> block >>>>> 16. After that, the blocks 14 and 20 are equal. >>>>> >>>>> In the second iteration, the blocks 14 and 20 are merged, by redirecting >>>>> the >>>>> incoming edges of block 20 to block 14, and removing block 20. >>>>> >>>>> Block 20 also contains the definition of .MEMD.2447_209. Removing the >>>>> definition >>>>> delinks the vuse of .MEMD.2447_209 in block 5: >>>>> ... >>>>> # BLOCK 5 freq:6036 >>>>> # PRED: 20 [85.0%] (true,exec) >>>>> >>>>> # PT = nonlocal escaped >>>>> D.2306_58 = &thisD.2200_10(D)->D.2156; >>>>> # .MEMD.2447_132 = VDEF <.MEMD.2447_209> >>>>> # USE = anything >>>>> # CLB = anything >>>>> drawLineD.2135 (D.2306_58, wD.2197_57(D), gcD.2198_59(D)); >>>>> goto ; >>>>> # SUCC: 17 [100.0%] (fallthru,exec) >>>>> ... >>>> >>>> And block 5 is retained and block 15 is discarded? >>>> >>> >>> Indeed. >>> >>>>> After the pass, when executing the TODO_update_ssa_only_virtuals, we >>>>> update the >>>>> drawLine call in block 5 using rewrite_update_stmt, which calls >>>>> maybe_replace_use for the vuse operand. >>>>> >>>>> However, maybe_replace_use doesn't have an effect since the old vuse and >>>>> the new >>>>> vuse happen to be the same (rdef == use), so SET_USE is not called and >>>>> the vuse >>>>> remains delinked: >>>>> ... >>>>> if (rdef && rdef != use) >>>>> SET_USE (use_p, rdef); >>>>> ... >>>>> >>>>> The patch fixes this by forcing SET_USE for delinked uses. >>>> >>>> That isn't the correct fix. Whoever unlinks the vuse (by removing its >>>> definition) has to replace it with something valid, which is either the >>>> bare symbol .MEM, or the VUSE associated with the removed VDEF >>>> (thus, as unlink_stmt_vdef does). >>>> >>> >>> Another try. For each deleted bb, we call unlink_stmt_vdef for the >>> statements, >>> and replace the .MEM phi uses with the bare .MEM symbol. >>> >>> Bootstrapped and reg-tested on x86_64. >>> >>> Ok for trunk? >> >> Better. For >> >> + >> + FOR_EACH_IMM_USE_STMT (use_stmt, iter, res) >> + { >> + FOR_EACH_IMM_USE_ON_STMT (use_p, iter) >> + SET_USE (use_p, SSA_NAME_VAR (res)); >> + } >> >> you can use mark_virtual_phi_result_for_renaming (phi) instead. >> >> + for (i = gsi_last_bb (bb); !gsi_end_p (i); gsi_prev_nondebug (&am