On 07/28/2011 06:25 PM, Richard Guenther wrote: > On Thu, 28 Jul 2011, Tom de Vries wrote: > >> On 07/28/2011 12:22 PM, Richard Guenther wrote: >>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>> >>>> On 07/27/2011 05:27 PM, Richard Guenther wrote: >>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>> >>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote: >>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>>> >>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote: >>>>>>>>> Hi Richard, >>>>>>>>> >>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted >>>>>>>>> twice. >>>>>>>>> >>>>>>>>> 01_pr43513.3.patch >>>>>>>>> 02_pr43513.3.test.patch >>>>>>>>> 03_pr43513.3.mudflap.patch >>>>>>>>> >>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64. >>>>>>>>> >>>>>>>>> I will sent out the patches individually. >>>>>>>>> >>>>>>>> >>>>>>>> The patch replaces a vla __builtin_alloca that has a constant argument >>>>>>>> with an >>>>>>>> array declaration. >>>>>>>> >>>>>>>> OK for trunk? >>>>>>> >>>>>>> I don't think it is safe to try to get at the VLA type the way you do. >>>>>> >>>>>> I don't understand in what way it's not safe. Do you mean I don't manage >>>>>> to find >>>>>> the type always, or that I find the wrong type, or something else? >>>>> >>>>> I think you might get the wrong type, >>>> >>>> Ok, I'll review that code one more time. >>>> >>>>> you also do not transform code >>>>> like >>>>> >>>>> int *p = alloca(4); >>>>> *p = 3; >>>>> >>>>> as there is no array type involved here. >>>>> >>>> >>>> I was trying to stay away from non-vla allocas. A source declared alloca >>>> has >>>> function livetime, so we could have a single alloca in a loop, called 10 >>>> times, >>>> with all 10 instances live at the same time. This patch does not detect >>>> such >>>> cases, and thus stays away from non-vla allocas. A vla decl does not have >>>> such >>>> problems, the lifetime ends when it goes out of scope. >>> >>> Yes indeed - that probably would require more detailed analysis. >>> >>>>>>> In fact I would simply do sth like >>>>>>> >>>>>>> elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); >>>>>>> n_elem = size * 8 / BITS_PER_UNIT; >>>>>>> array_type = build_array_type_nelts (elem_type, n_elem); >>>>>>> var = create_tmp_var (array_type, NULL); >>>>>>> return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var)); >>>>>>> >>>>>> >>>>>> I tried this code on the example, and it works, but the newly declared >>>>>> type has >>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment. >>>>>> This make >>>>>> the memory access in the example potentially unaligned, which prohibits >>>>>> an >>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64 >>>>>> achieved >>>>>> with my current patch. >>>>> >>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable >>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)). Basically the >>>>> alignment that the targets alloca function would guarantee. >>>>> >>>> >>>> I tried that, but that doesn't help. It's the alignment of the type that >>>> matters, not of the decl. >>> >>> It shouldn't. All accesses are performed with the original types and >>> alignment comes from that (plus the underlying decl). >>> >> >> I managed to get it all working by using build_aligned_type rather that >> DECL_ALIGN. > > That's really odd, DECL_ALIGN should just work - nothing refers to the > type of the decl in the IL. Can you try also setting DECL_USER_ALIGN to > 1 maybe? >
This doesn't work either. /* Declare array. */ elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); n_elem = size * 8 / BITS_PER_UNIT; align = MIN (size * 8, GET_MODE_PRECISION (word_mode)); array_type = build_array_type_nelts (elem_type, n_elem); var = create_tmp_var (array_type, NULL); DECL_ALIGN (var) = align; DECL_USER_ALIGN (var) = 1; Maybe this clarifies it: Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621 (gdb) call debug_generic_expr (ref) MEM[(int[0:D.2579] *)&D.2595][0] (gdb) call debug_generic_expr (step) 4 1627 base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode, (gdb) call debug_generic_expr (base) D.2595 1629 base_type = TREE_TYPE (base); (gdb) call debug_generic_expr (base_type) <unnamed-unsigned:8>[40] 1630 base_align = TYPE_ALIGN (base_type); (gdb) p base_align $1 = 8 So the align is 8-bits, and we return true here: (gdb) n 1632 if (mode != BLKmode) (gdb) n 1634 unsigned mode_align = GET_MODE_ALIGNMENT (mode); (gdb) 1636 if (base_align < mode_align (gdb) 1639 return true; Here we can see that the base actually has the (user) align on it: (gdb) call debug_tree (base) <var_decl 0xf7e1b420 D.2595 type <array_type 0xf7e1b360 type <integer_type 0xf7e1b2a0 public unsigned QI size <integer_cst 0xf7d3d604 constant 8> unit size <integer_cst 0xf7d3d620 constant 1> align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision 8 min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255> pointer_to_this <pointer_type 0xf7e1b3c0>> BLK size <integer_cst 0xf7d5d070 constant 320> unit size <integer_cst 0xf7dde2a0 constant 40> align 8 symtab 0 alias set -1 canonical type 0xf7e1b360 domain <integer_type 0xf7de12a0 type <integer_type 0xf7d51000 unsigned int> SI size <integer_cst 0xf7d3d78c constant 32> unit size <integer_cst 0xf7d3d578 constant 4> align 32 symtab 0 alias set -1 canonical type 0xf7de12a0 precision 32 min <integer_cst 0xf7d3d594 0> max <integer_cst 0xf7dde284 39>> pointer_to_this <pointer_type 0xf7e1b480>> addressable used ignored BLK file pr43513.c line 4 col 6 size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40> user align 32 context <function_decl 0xf7dfd480 foo3>> >> >>>> So should we try to find the base type of the vla, and use that, or use the >>>> nonstandard char type? >>> >>> I don't think we can reliably find the base type of the vla - well, >>> in practice we may because we control how we lower VLAs during >>> gimplification, but nothing in the IL constraints say that the >>> resulting pointer type should be special. >>> >>> Using a char[] decl shouldn't be a problem IMHO. >>> >>>>>>> And obviously you lose the optimization we arrange with inserting >>>>>>> __builtin_stack_save/restore pairs that way - stack space will no >>>>>>> longer be shared for subsequent VLAs. Which means that you'd >>>>>>> better limit the size you allow this promotion. >>>>>>> >>>>>> >>>>>> Right, I could introduce a parameter for this. >>>>> >>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say, >>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10? >>>>> >>>> >>>> That unfortunately is too small for the example from bug report. The >>>> default >>>> value of the param is 250, so that would be a threshold of 25, and the >>>> alloca >>>> size of the example is 40. Perhaps we can try a threshold of >>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such? >>> >>> Hm. estimated_stack_size is not O(1), so no. I think we need to >>> find a sensible way of allowing stack sharing. Eventually Michas >>> patch for introducing points-of-death would help here, if we'd >>> go for folding this during stack-save/restore optimization. >>> >> >> I changed the heuristics to this: >> >> + /* Heuristic: don't fold large vlas. */ >> + threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME); >> + /* In case a vla is declared at function scope, it has the same lifetime >> as a >> + declared array, so we allow a larger size. */ >> + block = gimple_block (stmt); >> + if (!(cfun->after_inlining >> + && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL)) >> + threshold /= 10; >> + if (size > threshold) >> + return NULL_TREE; >> >> The heuristics distinguishes between before and after inlining. >> >> After inlining, vla's declared at function scope have the same lifetimes as >> declared arrays, and don't share their space. There should be no negative >> effects from folding an alloca in this case, but for safety we set a >> threshold >> of PARAM_LARGE_STACK_FRAME. >> >> Before inlining, such a vla might be inlined and share its space with another >> vla, so we stick with the normal threshold before inlining. > > That sounds reasonable, though the block check should probably use the > original VLA decl block, not that of the basic-block of the allocation, > but unfortunately we don't have access to that. So I suppose using > the allocation basic-block BLOCK is good enough (still we don't > really care about BLOCK boundaries when doing CFG manipulations, so > the allocation bbs block may be not the outermost scope in more cases > than necessary). > >> However, using this heuristic we still don't generate optimal code. >> >> During the first pass_ccp, the folding is not done, because the size (40) is >> larger than the threshold 25. The threshold is 25, because inlining is not >> yet done. >> >> During pass_fold_builtins, the folding is done because it's after inlining, >> but >> it's later than pass_iv_optimize, so that still doesn't yield the optimal >> size >> of 64. >> >> The folding is not done during any of the other invocations or pass_ccp, >> because >> the argument has already become constant in the earlier invocation. > > Yeah, that's the issue with relying on folding to do this transformation. > >> Using this change, I manage to trigger folding during the second invocation >> of >> pass_ccp, before iv_optimize so we generate optimal code. >> >> Index: gcc/tree-ssa-ccp.c >> =================================================================== >> --- gcc/tree-ssa-ccp.c (revision 173734) >> +++ gcc/tree-ssa-ccp.c (working copy) >> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi >> if (gimple_call_internal_p (stmt)) >> return false; >> >> + /* The heuristic of fold_builtin_alloca differs before and after >> + inlining, so we don't require the arg to be changed into a >> constant >> + for folding, but just to be constant. */ >> + if (gimple_call_alloca_for_var_p (stmt) >> + && get_constant_value (gimple_call_arg (stmt, 0))) > > Probably reverse the get_constant_value check and the transformation Done. > (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects, > so its name should be changed). > >> + return true; >> + >> /* Propagate into the call arguments. Compared to replace_uses_in >> this can use the argument slot types for type verification >> instead of the current argument type. We also can safely >> >> But, to me it feels like a hack. Do you have any ideas how to do this better? > > It's somewhat of a hack, but at least it is more of a defined place > for this transformation - which then suggests to remove it from > generic folding and only keep calling it from CCP this way. > Done. > Richard. > >> Attaching untested patch for reference (will test overnight). >> >> Thanks, >> - Tom >> >> 2011-07-28 Tom de Vries <t...@codesourcery.com> >> >> PR middle-end/43513 >> * gimple-fold.c (params.h): Include. >> (fold_builtin_alloca): New function. >> (gimple_fold_builtin): Use fold_builtin_alloca. >> * tree-ssa-ccp.c (ccp_fold_stmt): Force folding of vla-related alloca. >> * Makefile.in (gimple-fold.o): Add $(PARAMS_H) to rule. Thanks, - Tom