On 07/30/2011 10:21 AM, Tom de Vries wrote: > Hi, > > On 07/28/2011 08:20 PM, Tom de Vries wrote: >> On 07/28/2011 06:25 PM, Richard Guenther wrote: >>> On Thu, 28 Jul 2011, Tom de Vries wrote: >>> >>>> On 07/28/2011 12:22 PM, Richard Guenther wrote: >>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>> >>>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote: >>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>>> >>>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote: >>>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>>>>> >>>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote: >>>>>>>>>>> Hi Richard, >>>>>>>>>>> >>>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted >>>>>>>>>>> twice. >>>>>>>>>>> >>>>>>>>>>> 01_pr43513.3.patch >>>>>>>>>>> 02_pr43513.3.test.patch >>>>>>>>>>> 03_pr43513.3.mudflap.patch >>>>>>>>>>> >>>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64. >>>>>>>>>>> >>>>>>>>>>> I will sent out the patches individually. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant >>>>>>>>>> argument with an >>>>>>>>>> array declaration. >>>>>>>>>> >>>>>>>>>> OK for trunk? >>>>>>>>> >>>>>>>>> I don't think it is safe to try to get at the VLA type the way you do. >>>>>>>> >>>>>>>> I don't understand in what way it's not safe. Do you mean I don't >>>>>>>> manage to find >>>>>>>> the type always, or that I find the wrong type, or something else? >>>>>>> >>>>>>> I think you might get the wrong type, >>>>>> >>>>>> Ok, I'll review that code one more time. >>>>>> >>>>>>> you also do not transform code >>>>>>> like >>>>>>> >>>>>>> int *p = alloca(4); >>>>>>> *p = 3; >>>>>>> >>>>>>> as there is no array type involved here. >>>>>>> >>>>>> >>>>>> I was trying to stay away from non-vla allocas. A source declared >>>>>> alloca has >>>>>> function livetime, so we could have a single alloca in a loop, called 10 >>>>>> times, >>>>>> with all 10 instances live at the same time. This patch does not detect >>>>>> such >>>>>> cases, and thus stays away from non-vla allocas. A vla decl does not >>>>>> have such >>>>>> problems, the lifetime ends when it goes out of scope. >>>>> >>>>> Yes indeed - that probably would require more detailed analysis. >>>>> >>>>>>>>> In fact I would simply do sth like >>>>>>>>> >>>>>>>>> elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); >>>>>>>>> n_elem = size * 8 / BITS_PER_UNIT; >>>>>>>>> array_type = build_array_type_nelts (elem_type, n_elem); >>>>>>>>> var = create_tmp_var (array_type, NULL); >>>>>>>>> return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var)); >>>>>>>>> >>>>>>>> >>>>>>>> I tried this code on the example, and it works, but the newly declared >>>>>>>> type has >>>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment. >>>>>>>> This make >>>>>>>> the memory access in the example potentially unaligned, which >>>>>>>> prohibits an >>>>>>>> ivopts optimization, so the resulting text size is 68 instead of the >>>>>>>> 64 achieved >>>>>>>> with my current patch. >>>>>>> >>>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable >>>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)). Basically the >>>>>>> alignment that the targets alloca function would guarantee. >>>>>>> >>>>>> >>>>>> I tried that, but that doesn't help. It's the alignment of the type that >>>>>> matters, not of the decl. >>>>> >>>>> It shouldn't. All accesses are performed with the original types and >>>>> alignment comes from that (plus the underlying decl). >>>>> >>>> >>>> I managed to get it all working by using build_aligned_type rather that >>>> DECL_ALIGN. >>> >>> That's really odd, DECL_ALIGN should just work - nothing refers to the >>> type of the decl in the IL. Can you try also setting DECL_USER_ALIGN to >>> 1 maybe? >>> >> >> This doesn't work either. >> >> /* Declare array. */ >> elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); >> n_elem = size * 8 / BITS_PER_UNIT; >> align = MIN (size * 8, GET_MODE_PRECISION (word_mode)); >> array_type = build_array_type_nelts (elem_type, n_elem); >> var = create_tmp_var (array_type, NULL); >> DECL_ALIGN (var) = align; >> DECL_USER_ALIGN (var) = 1; >> >> Maybe this clarifies it: >> >> Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at >> /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621 >> (gdb) call debug_generic_expr (ref) >> MEM[(int[0:D.2579] *)&D.2595][0] >> (gdb) call debug_generic_expr (step) >> 4 >> >> 1627 base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode, >> (gdb) call debug_generic_expr (base) >> D.2595 >> >> 1629 base_type = TREE_TYPE (base); >> (gdb) call debug_generic_expr (base_type) >> <unnamed-unsigned:8>[40] >> >> 1630 base_align = TYPE_ALIGN (base_type); >> (gdb) p base_align >> $1 = 8 >> >> So the align is 8-bits, and we return true here: >> >> (gdb) n >> 1632 if (mode != BLKmode) >> (gdb) n >> 1634 unsigned mode_align = GET_MODE_ALIGNMENT (mode); >> (gdb) >> 1636 if (base_align < mode_align >> (gdb) >> 1639 return true; >> >> >> Here we can see that the base actually has the (user) align on it: >> >> (gdb) call debug_tree (base) >> <var_decl 0xf7e1b420 D.2595 >> type <array_type 0xf7e1b360 >> type <integer_type 0xf7e1b2a0 public unsigned QI >> size <integer_cst 0xf7d3d604 constant 8> >> unit size <integer_cst 0xf7d3d620 constant 1> >> align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 >> precision 8 >> min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255> >> pointer_to_this <pointer_type 0xf7e1b3c0>> >> BLK >> size <integer_cst 0xf7d5d070 constant 320> >> unit size <integer_cst 0xf7dde2a0 constant 40> >> align 8 symtab 0 alias set -1 canonical type 0xf7e1b360 >> domain <integer_type 0xf7de12a0 >> type <integer_type 0xf7d51000 unsigned int> >> SI >> size <integer_cst 0xf7d3d78c constant 32> >> unit size <integer_cst 0xf7d3d578 constant 4> >> align 32 symtab 0 alias set -1 canonical type 0xf7de12a0 >> precision 32 min <integer_cst 0xf7d3d594 0> >> max <integer_cst 0xf7dde284 39>> >> pointer_to_this <pointer_type 0xf7e1b480>> >> addressable used ignored BLK file pr43513.c line 4 col 6 >> size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40> >> user align 32 context <function_decl 0xf7dfd480 foo3>> >> >> >>>> >>>>>> So should we try to find the base type of the vla, and use that, or use >>>>>> the >>>>>> nonstandard char type? >>>>> >>>>> I don't think we can reliably find the base type of the vla - well, >>>>> in practice we may because we control how we lower VLAs during >>>>> gimplification, but nothing in the IL constraints say that the >>>>> resulting pointer type should be special. >>>>> >>>>> Using a char[] decl shouldn't be a problem IMHO. >>>>> >>>>>>>>> And obviously you lose the optimization we arrange with inserting >>>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no >>>>>>>>> longer be shared for subsequent VLAs. Which means that you'd >>>>>>>>> better limit the size you allow this promotion. >>>>>>>>> >>>>>>>> >>>>>>>> Right, I could introduce a parameter for this. >>>>>>> >>>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say, >>>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10? >>>>>>> >>>>>> >>>>>> That unfortunately is too small for the example from bug report. The >>>>>> default >>>>>> value of the param is 250, so that would be a threshold of 25, and the >>>>>> alloca >>>>>> size of the example is 40. Perhaps we can try a threshold of >>>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such? >>>>> >>>>> Hm. estimated_stack_size is not O(1), so no. I think we need to >>>>> find a sensible way of allowing stack sharing. Eventually Michas >>>>> patch for introducing points-of-death would help here, if we'd >>>>> go for folding this during stack-save/restore optimization. >>>>> >>>> >>>> I changed the heuristics to this: >>>> >>>> + /* Heuristic: don't fold large vlas. */ >>>> + threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE >>>> (PARAM_LARGE_STACK_FRAME); >>>> + /* In case a vla is declared at function scope, it has the same >>>> lifetime as a >>>> + declared array, so we allow a larger size. */ >>>> + block = gimple_block (stmt); >>>> + if (!(cfun->after_inlining >>>> + && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL)) >>>> + threshold /= 10; >>>> + if (size > threshold) >>>> + return NULL_TREE; >>>> >>>> The heuristics distinguishes between before and after inlining. >>>> >>>> After inlining, vla's declared at function scope have the same lifetimes as >>>> declared arrays, and don't share their space. There should be no negative >>>> effects from folding an alloca in this case, but for safety we set a >>>> threshold >>>> of PARAM_LARGE_STACK_FRAME. >>>> >>>> Before inlining, such a vla might be inlined and share its space with >>>> another >>>> vla, so we stick with the normal threshold before inlining. >>> >>> That sounds reasonable, though the block check should probably use the >>> original VLA decl block, not that of the basic-block of the allocation, >>> but unfortunately we don't have access to that. So I suppose using >>> the allocation basic-block BLOCK is good enough (still we don't >>> really care about BLOCK boundaries when doing CFG manipulations, so >>> the allocation bbs block may be not the outermost scope in more cases >>> than necessary). >>> >>>> However, using this heuristic we still don't generate optimal code. >>>> >>>> During the first pass_ccp, the folding is not done, because the size (40) >>>> is >>>> larger than the threshold 25. The threshold is 25, because inlining is not >>>> yet done. >>>> >>>> During pass_fold_builtins, the folding is done because it's after >>>> inlining, but >>>> it's later than pass_iv_optimize, so that still doesn't yield the optimal >>>> size >>>> of 64. >>>> >>>> The folding is not done during any of the other invocations or pass_ccp, >>>> because >>>> the argument has already become constant in the earlier invocation. >>> >>> Yeah, that's the issue with relying on folding to do this transformation. >>> >>>> Using this change, I manage to trigger folding during the second >>>> invocation of >>>> pass_ccp, before iv_optimize so we generate optimal code. >>>> >>>> Index: gcc/tree-ssa-ccp.c >>>> =================================================================== >>>> --- gcc/tree-ssa-ccp.c (revision 173734) >>>> +++ gcc/tree-ssa-ccp.c (working copy) >>>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi >>>> if (gimple_call_internal_p (stmt)) >>>> return false; >>>> >>>> + /* The heuristic of fold_builtin_alloca differs before and after >>>> + inlining, so we don't require the arg to be changed into a >>>> constant >>>> + for folding, but just to be constant. */ >>>> + if (gimple_call_alloca_for_var_p (stmt) >>>> + && get_constant_value (gimple_call_arg (stmt, 0))) >>> >>> Probably reverse the get_constant_value check and the transformation >> >> Done. >> >>> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects, >>> so its name should be changed). >>> >>>> + return true; >>>> + >>>> /* Propagate into the call arguments. Compared to replace_uses_in >>>> this can use the argument slot types for type verification >>>> instead of the current argument type. We also can safely >>>> >>>> But, to me it feels like a hack. Do you have any ideas how to do this >>>> better? >>> >>> It's somewhat of a hack, but at least it is more of a defined place >>> for this transformation - which then suggests to remove it from >>> generic folding and only keep calling it from CCP this way. >>> >> >> Done. >> > > This is an updated version of the patch. I have 2 new patches and an updated > testcase which I will sent out individually. >
2011-07-30 Tom de Vries <t...@codesourcery.com> PR middle-end/43513 * tree-ssa-loop-ivopts.c (may_be_unaligned_p): Use get_object_alignment.
Index: gcc/tree-ssa-loop-ivopts.c =================================================================== --- gcc/tree-ssa-loop-ivopts.c (revision 173734) +++ gcc/tree-ssa-loop-ivopts.c (working copy) @@ -1608,7 +1608,6 @@ static bool may_be_unaligned_p (tree ref, tree step) { tree base; - tree base_type; HOST_WIDE_INT bitsize; HOST_WIDE_INT bitpos; tree toffset; @@ -1626,8 +1625,7 @@ may_be_unaligned_p (tree ref, tree step) STRICT_ALIGNMENT is true. */ base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode, &unsignedp, &volatilep, true); - base_type = TREE_TYPE (base); - base_align = TYPE_ALIGN (base_type); + base_align = get_object_alignment (base, BIGGEST_ALIGNMENT); if (mode != BLKmode) {