Hi, On 07/28/2011 08:20 PM, Tom de Vries wrote: > On 07/28/2011 06:25 PM, Richard Guenther wrote: >> On Thu, 28 Jul 2011, Tom de Vries wrote: >> >>> On 07/28/2011 12:22 PM, Richard Guenther wrote: >>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>> >>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote: >>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>> >>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote: >>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>>>> >>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote: >>>>>>>>>> Hi Richard, >>>>>>>>>> >>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted >>>>>>>>>> twice. >>>>>>>>>> >>>>>>>>>> 01_pr43513.3.patch >>>>>>>>>> 02_pr43513.3.test.patch >>>>>>>>>> 03_pr43513.3.mudflap.patch >>>>>>>>>> >>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64. >>>>>>>>>> >>>>>>>>>> I will sent out the patches individually. >>>>>>>>>> >>>>>>>>> >>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant >>>>>>>>> argument with an >>>>>>>>> array declaration. >>>>>>>>> >>>>>>>>> OK for trunk? >>>>>>>> >>>>>>>> I don't think it is safe to try to get at the VLA type the way you do. >>>>>>> >>>>>>> I don't understand in what way it's not safe. Do you mean I don't >>>>>>> manage to find >>>>>>> the type always, or that I find the wrong type, or something else? >>>>>> >>>>>> I think you might get the wrong type, >>>>> >>>>> Ok, I'll review that code one more time. >>>>> >>>>>> you also do not transform code >>>>>> like >>>>>> >>>>>> int *p = alloca(4); >>>>>> *p = 3; >>>>>> >>>>>> as there is no array type involved here. >>>>>> >>>>> >>>>> I was trying to stay away from non-vla allocas. A source declared alloca >>>>> has >>>>> function livetime, so we could have a single alloca in a loop, called 10 >>>>> times, >>>>> with all 10 instances live at the same time. This patch does not detect >>>>> such >>>>> cases, and thus stays away from non-vla allocas. A vla decl does not have >>>>> such >>>>> problems, the lifetime ends when it goes out of scope. >>>> >>>> Yes indeed - that probably would require more detailed analysis. >>>> >>>>>>>> In fact I would simply do sth like >>>>>>>> >>>>>>>> elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); >>>>>>>> n_elem = size * 8 / BITS_PER_UNIT; >>>>>>>> array_type = build_array_type_nelts (elem_type, n_elem); >>>>>>>> var = create_tmp_var (array_type, NULL); >>>>>>>> return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var)); >>>>>>>> >>>>>>> >>>>>>> I tried this code on the example, and it works, but the newly declared >>>>>>> type has >>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment. >>>>>>> This make >>>>>>> the memory access in the example potentially unaligned, which prohibits >>>>>>> an >>>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64 >>>>>>> achieved >>>>>>> with my current patch. >>>>>> >>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable >>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)). Basically the >>>>>> alignment that the targets alloca function would guarantee. >>>>>> >>>>> >>>>> I tried that, but that doesn't help. It's the alignment of the type that >>>>> matters, not of the decl. >>>> >>>> It shouldn't. All accesses are performed with the original types and >>>> alignment comes from that (plus the underlying decl). >>>> >>> >>> I managed to get it all working by using build_aligned_type rather that >>> DECL_ALIGN. >> >> That's really odd, DECL_ALIGN should just work - nothing refers to the >> type of the decl in the IL. Can you try also setting DECL_USER_ALIGN to >> 1 maybe? >> > > This doesn't work either. > > /* Declare array. */ > elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); > n_elem = size * 8 / BITS_PER_UNIT; > align = MIN (size * 8, GET_MODE_PRECISION (word_mode)); > array_type = build_array_type_nelts (elem_type, n_elem); > var = create_tmp_var (array_type, NULL); > DECL_ALIGN (var) = align; > DECL_USER_ALIGN (var) = 1; > > Maybe this clarifies it: > > Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at > /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621 > (gdb) call debug_generic_expr (ref) > MEM[(int[0:D.2579] *)&D.2595][0] > (gdb) call debug_generic_expr (step) > 4 > > 1627 base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode, > (gdb) call debug_generic_expr (base) > D.2595 > > 1629 base_type = TREE_TYPE (base); > (gdb) call debug_generic_expr (base_type) > <unnamed-unsigned:8>[40] > > 1630 base_align = TYPE_ALIGN (base_type); > (gdb) p base_align > $1 = 8 > > So the align is 8-bits, and we return true here: > > (gdb) n > 1632 if (mode != BLKmode) > (gdb) n > 1634 unsigned mode_align = GET_MODE_ALIGNMENT (mode); > (gdb) > 1636 if (base_align < mode_align > (gdb) > 1639 return true; > > > Here we can see that the base actually has the (user) align on it: > > (gdb) call debug_tree (base) > <var_decl 0xf7e1b420 D.2595 > type <array_type 0xf7e1b360 > type <integer_type 0xf7e1b2a0 public unsigned QI > size <integer_cst 0xf7d3d604 constant 8> > unit size <integer_cst 0xf7d3d620 constant 1> > align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision > 8 > min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255> > pointer_to_this <pointer_type 0xf7e1b3c0>> > BLK > size <integer_cst 0xf7d5d070 constant 320> > unit size <integer_cst 0xf7dde2a0 constant 40> > align 8 symtab 0 alias set -1 canonical type 0xf7e1b360 > domain <integer_type 0xf7de12a0 > type <integer_type 0xf7d51000 unsigned int> > SI > size <integer_cst 0xf7d3d78c constant 32> > unit size <integer_cst 0xf7d3d578 constant 4> > align 32 symtab 0 alias set -1 canonical type 0xf7de12a0 > precision 32 min <integer_cst 0xf7d3d594 0> > max <integer_cst 0xf7dde284 39>> > pointer_to_this <pointer_type 0xf7e1b480>> > addressable used ignored BLK file pr43513.c line 4 col 6 > size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40> > user align 32 context <function_decl 0xf7dfd480 foo3>> > > >>> >>>>> So should we try to find the base type of the vla, and use that, or use >>>>> the >>>>> nonstandard char type? >>>> >>>> I don't think we can reliably find the base type of the vla - well, >>>> in practice we may because we control how we lower VLAs during >>>> gimplification, but nothing in the IL constraints say that the >>>> resulting pointer type should be special. >>>> >>>> Using a char[] decl shouldn't be a problem IMHO. >>>> >>>>>>>> And obviously you lose the optimization we arrange with inserting >>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no >>>>>>>> longer be shared for subsequent VLAs. Which means that you'd >>>>>>>> better limit the size you allow this promotion. >>>>>>>> >>>>>>> >>>>>>> Right, I could introduce a parameter for this. >>>>>> >>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say, >>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10? >>>>>> >>>>> >>>>> That unfortunately is too small for the example from bug report. The >>>>> default >>>>> value of the param is 250, so that would be a threshold of 25, and the >>>>> alloca >>>>> size of the example is 40. Perhaps we can try a threshold of >>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such? >>>> >>>> Hm. estimated_stack_size is not O(1), so no. I think we need to >>>> find a sensible way of allowing stack sharing. Eventually Michas >>>> patch for introducing points-of-death would help here, if we'd >>>> go for folding this during stack-save/restore optimization. >>>> >>> >>> I changed the heuristics to this: >>> >>> + /* Heuristic: don't fold large vlas. */ >>> + threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE >>> (PARAM_LARGE_STACK_FRAME); >>> + /* In case a vla is declared at function scope, it has the same lifetime >>> as a >>> + declared array, so we allow a larger size. */ >>> + block = gimple_block (stmt); >>> + if (!(cfun->after_inlining >>> + && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL)) >>> + threshold /= 10; >>> + if (size > threshold) >>> + return NULL_TREE; >>> >>> The heuristics distinguishes between before and after inlining. >>> >>> After inlining, vla's declared at function scope have the same lifetimes as >>> declared arrays, and don't share their space. There should be no negative >>> effects from folding an alloca in this case, but for safety we set a >>> threshold >>> of PARAM_LARGE_STACK_FRAME. >>> >>> Before inlining, such a vla might be inlined and share its space with >>> another >>> vla, so we stick with the normal threshold before inlining. >> >> That sounds reasonable, though the block check should probably use the >> original VLA decl block, not that of the basic-block of the allocation, >> but unfortunately we don't have access to that. So I suppose using >> the allocation basic-block BLOCK is good enough (still we don't >> really care about BLOCK boundaries when doing CFG manipulations, so >> the allocation bbs block may be not the outermost scope in more cases >> than necessary). >> >>> However, using this heuristic we still don't generate optimal code. >>> >>> During the first pass_ccp, the folding is not done, because the size (40) is >>> larger than the threshold 25. The threshold is 25, because inlining is not >>> yet done. >>> >>> During pass_fold_builtins, the folding is done because it's after inlining, >>> but >>> it's later than pass_iv_optimize, so that still doesn't yield the optimal >>> size >>> of 64. >>> >>> The folding is not done during any of the other invocations or pass_ccp, >>> because >>> the argument has already become constant in the earlier invocation. >> >> Yeah, that's the issue with relying on folding to do this transformation. >> >>> Using this change, I manage to trigger folding during the second invocation >>> of >>> pass_ccp, before iv_optimize so we generate optimal code. >>> >>> Index: gcc/tree-ssa-ccp.c >>> =================================================================== >>> --- gcc/tree-ssa-ccp.c (revision 173734) >>> +++ gcc/tree-ssa-ccp.c (working copy) >>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi >>> if (gimple_call_internal_p (stmt)) >>> return false; >>> >>> + /* The heuristic of fold_builtin_alloca differs before and after >>> + inlining, so we don't require the arg to be changed into a >>> constant >>> + for folding, but just to be constant. */ >>> + if (gimple_call_alloca_for_var_p (stmt) >>> + && get_constant_value (gimple_call_arg (stmt, 0))) >> >> Probably reverse the get_constant_value check and the transformation > > Done. > >> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects, >> so its name should be changed). >> >>> + return true; >>> + >>> /* Propagate into the call arguments. Compared to replace_uses_in >>> this can use the argument slot types for type verification >>> instead of the current argument type. We also can safely >>> >>> But, to me it feels like a hack. Do you have any ideas how to do this >>> better? >> >> It's somewhat of a hack, but at least it is more of a defined place >> for this transformation - which then suggests to remove it from >> generic folding and only keep calling it from CCP this way. >> > > Done. >
This is an updated version of the patch. I have 2 new patches and an updated testcase which I will sent out individually. Patch set was bootstrapped and reg-tested on x86_64. Ok for trunk? Thanks, - Tom 2011-07-30 Tom de Vries <t...@codesourcery.com> PR middle-end/43513 * Makefile.in (tree-ssa-ccp.o): Add $(PARAMS_H) to rule. * tree-ssa-ccp.c (params.h): Include. (fold_builtin_alloca_for_var): New function. (ccp_fold_stmt): Use fold_builtin_alloca_for_var.
Index: gcc/tree-ssa-ccp.c =================================================================== --- gcc/tree-ssa-ccp.c (revision 173734) +++ gcc/tree-ssa-ccp.c (working copy) @@ -133,6 +133,7 @@ along with GCC; see the file COPYING3. #include "diagnostic-core.h" #include "dbgcnt.h" #include "gimple-fold.h" +#include "params.h" /* Possible lattice values. */ @@ -1659,6 +1660,51 @@ evaluate_stmt (gimple stmt) return val; } +/* Detects a vla-related alloca with a constant argument. Declares fixed-size + array and return the address, if found, otherwise returns NULL_TREE. */ + +static tree +fold_builtin_alloca_for_var (gimple stmt) +{ + unsigned HOST_WIDE_INT size, threshold, n_elem; + tree lhs, arg, block, var, elem_type, array_type; + unsigned int align; + + /* Get lhs. */ + lhs = gimple_call_lhs (stmt); + if (lhs == NULL_TREE) + return NULL_TREE; + + /* Detect constant argument. */ + arg = get_constant_value (gimple_call_arg (stmt, 0)); + if (arg == NULL_TREE || TREE_CODE (arg) != INTEGER_CST + || !host_integerp (arg, 1)) + return NULL_TREE; + size = TREE_INT_CST_LOW (arg); + + /* Heuristic: don't fold large vlas. */ + threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME); + /* In case a vla is declared at function scope, it has the same lifetime as a + declared array, so we allow a larger size. */ + block = gimple_block (stmt); + if (!(cfun->after_inlining + && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL)) + threshold /= 10; + if (size > threshold) + return NULL_TREE; + + /* Declare array. */ + elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); + n_elem = size * 8 / BITS_PER_UNIT; + align = MIN (size * 8, GET_MODE_PRECISION (word_mode)); + array_type = build_array_type_nelts (elem_type, n_elem); + var = create_tmp_var (array_type, NULL); + DECL_ALIGN (var) = align; + + /* Fold alloca to the address of the array. */ + return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var)); +} + /* Fold the stmt at *GSI with CCP specific information that propagating and regular folding does not catch. */ @@ -1727,6 +1773,20 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi if (gimple_call_internal_p (stmt)) return false; + /* The heuristic of fold_builtin_alloca_for_var differs before and after + inlining, so we don't require the arg to be changed into a constant + for folding, but just to be constant. */ + if (gimple_call_alloca_for_var_p (stmt)) + { + tree new_rhs = fold_builtin_alloca_for_var (stmt); + bool res; + if (new_rhs == NULL_TREE) + return false; + res = update_call_from_tree (gsi, new_rhs); + gcc_assert (res); + return true; + } + /* Propagate into the call arguments. Compared to replace_uses_in this can use the argument slot types for type verification instead of the current argument type. We also can safely Index: gcc/Makefile.in =================================================================== --- gcc/Makefile.in (revision 173734) +++ gcc/Makefile.in (working copy) @@ -3157,7 +3157,7 @@ tree-call-cdce.o : tree-call-cdce.c $(CO tree-ssa-ccp.o : tree-ssa-ccp.c $(TREE_FLOW_H) $(CONFIG_H) \ $(SYSTEM_H) $(TREE_H) $(TM_P_H) $(EXPR_H) output.h \ $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \ - $(TREE_DUMP_H) $(BASIC_BLOCK_H) $(TREE_PASS_H) langhooks.h \ + $(TREE_DUMP_H) $(BASIC_BLOCK_H) $(TREE_PASS_H) langhooks.h $(PARAMS_H) \ tree-ssa-propagate.h value-prof.h $(FLAGS_H) $(TARGET_H) $(DIAGNOSTIC_CORE_H) \ $(DBGCNT_H) tree-pretty-print.h gimple-pretty-print.h gimple-fold.h tree-sra.o : tree-sra.c $(CONFIG_H) $(SYSTEM_H) coretypes.h alloc-pool.h \