On Sat, Jul 30, 2011 at 9:34 AM, Tom de Vries <vr...@codesourcery.com> wrote: > On 07/30/2011 10:21 AM, Tom de Vries wrote: >> Hi, >> >> On 07/28/2011 08:20 PM, Tom de Vries wrote: >>> On 07/28/2011 06:25 PM, Richard Guenther wrote: >>>> On Thu, 28 Jul 2011, Tom de Vries wrote: >>>> >>>>> On 07/28/2011 12:22 PM, Richard Guenther wrote: >>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>> >>>>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote: >>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>>>> >>>>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote: >>>>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>>>>>> >>>>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote: >>>>>>>>>>>> Hi Richard, >>>>>>>>>>>> >>>>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted >>>>>>>>>>>> twice. >>>>>>>>>>>> >>>>>>>>>>>> 01_pr43513.3.patch >>>>>>>>>>>> 02_pr43513.3.test.patch >>>>>>>>>>>> 03_pr43513.3.mudflap.patch >>>>>>>>>>>> >>>>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64. >>>>>>>>>>>> >>>>>>>>>>>> I will sent out the patches individually. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant >>>>>>>>>>> argument with an >>>>>>>>>>> array declaration. >>>>>>>>>>> >>>>>>>>>>> OK for trunk? >>>>>>>>>> >>>>>>>>>> I don't think it is safe to try to get at the VLA type the way you >>>>>>>>>> do. >>>>>>>>> >>>>>>>>> I don't understand in what way it's not safe. Do you mean I don't >>>>>>>>> manage to find >>>>>>>>> the type always, or that I find the wrong type, or something else? >>>>>>>> >>>>>>>> I think you might get the wrong type, >>>>>>> >>>>>>> Ok, I'll review that code one more time. >>>>>>> >>>>>>>> you also do not transform code >>>>>>>> like >>>>>>>> >>>>>>>> int *p = alloca(4); >>>>>>>> *p = 3; >>>>>>>> >>>>>>>> as there is no array type involved here. >>>>>>>> >>>>>>> >>>>>>> I was trying to stay away from non-vla allocas. A source declared >>>>>>> alloca has >>>>>>> function livetime, so we could have a single alloca in a loop, called >>>>>>> 10 times, >>>>>>> with all 10 instances live at the same time. This patch does not detect >>>>>>> such >>>>>>> cases, and thus stays away from non-vla allocas. A vla decl does not >>>>>>> have such >>>>>>> problems, the lifetime ends when it goes out of scope. >>>>>> >>>>>> Yes indeed - that probably would require more detailed analysis. >>>>>> >>>>>>>>>> In fact I would simply do sth like >>>>>>>>>> >>>>>>>>>> elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); >>>>>>>>>> n_elem = size * 8 / BITS_PER_UNIT; >>>>>>>>>> array_type = build_array_type_nelts (elem_type, n_elem); >>>>>>>>>> var = create_tmp_var (array_type, NULL); >>>>>>>>>> return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var)); >>>>>>>>>> >>>>>>>>> >>>>>>>>> I tried this code on the example, and it works, but the newly >>>>>>>>> declared type has >>>>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment. >>>>>>>>> This make >>>>>>>>> the memory access in the example potentially unaligned, which >>>>>>>>> prohibits an >>>>>>>>> ivopts optimization, so the resulting text size is 68 instead of the >>>>>>>>> 64 achieved >>>>>>>>> with my current patch. >>>>>>>> >>>>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable >>>>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)). Basically the >>>>>>>> alignment that the targets alloca function would guarantee. >>>>>>>> >>>>>>> >>>>>>> I tried that, but that doesn't help. It's the alignment of the type that >>>>>>> matters, not of the decl. >>>>>> >>>>>> It shouldn't. All accesses are performed with the original types and >>>>>> alignment comes from that (plus the underlying decl). >>>>>> >>>>> >>>>> I managed to get it all working by using build_aligned_type rather that >>>>> DECL_ALIGN. >>>> >>>> That's really odd, DECL_ALIGN should just work - nothing refers to the >>>> type of the decl in the IL. Can you try also setting DECL_USER_ALIGN to >>>> 1 maybe? >>>> >>> >>> This doesn't work either. >>> >>> /* Declare array. */ >>> elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); >>> n_elem = size * 8 / BITS_PER_UNIT; >>> align = MIN (size * 8, GET_MODE_PRECISION (word_mode)); >>> array_type = build_array_type_nelts (elem_type, n_elem); >>> var = create_tmp_var (array_type, NULL); >>> DECL_ALIGN (var) = align; >>> DECL_USER_ALIGN (var) = 1; >>> >>> Maybe this clarifies it: >>> >>> Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at >>> /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621 >>> (gdb) call debug_generic_expr (ref) >>> MEM[(int[0:D.2579] *)&D.2595][0] >>> (gdb) call debug_generic_expr (step) >>> 4 >>> >>> 1627 base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode, >>> (gdb) call debug_generic_expr (base) >>> D.2595 >>> >>> 1629 base_type = TREE_TYPE (base); >>> (gdb) call debug_generic_expr (base_type) >>> <unnamed-unsigned:8>[40] >>> >>> 1630 base_align = TYPE_ALIGN (base_type); >>> (gdb) p base_align >>> $1 = 8 >>> >>> So the align is 8-bits, and we return true here: >>> >>> (gdb) n >>> 1632 if (mode != BLKmode) >>> (gdb) n >>> 1634 unsigned mode_align = GET_MODE_ALIGNMENT (mode); >>> (gdb) >>> 1636 if (base_align < mode_align >>> (gdb) >>> 1639 return true; >>> >>> >>> Here we can see that the base actually has the (user) align on it: >>> >>> (gdb) call debug_tree (base) >>> <var_decl 0xf7e1b420 D.2595 >>> type <array_type 0xf7e1b360 >>> type <integer_type 0xf7e1b2a0 public unsigned QI >>> size <integer_cst 0xf7d3d604 constant 8> >>> unit size <integer_cst 0xf7d3d620 constant 1> >>> align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 >>> precision 8 >>> min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255> >>> pointer_to_this <pointer_type 0xf7e1b3c0>> >>> BLK >>> size <integer_cst 0xf7d5d070 constant 320> >>> unit size <integer_cst 0xf7dde2a0 constant 40> >>> align 8 symtab 0 alias set -1 canonical type 0xf7e1b360 >>> domain <integer_type 0xf7de12a0 >>> type <integer_type 0xf7d51000 unsigned int> >>> SI >>> size <integer_cst 0xf7d3d78c constant 32> >>> unit size <integer_cst 0xf7d3d578 constant 4> >>> align 32 symtab 0 alias set -1 canonical type 0xf7de12a0 >>> precision 32 min <integer_cst 0xf7d3d594 0> >>> max <integer_cst 0xf7dde284 39>> >>> pointer_to_this <pointer_type 0xf7e1b480>> >>> addressable used ignored BLK file pr43513.c line 4 col 6 >>> size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40> >>> user align 32 context <function_decl 0xf7dfd480 foo3>> >>> >>> >>>>> >>>>>>> So should we try to find the base type of the vla, and use that, or use >>>>>>> the >>>>>>> nonstandard char type? >>>>>> >>>>>> I don't think we can reliably find the base type of the vla - well, >>>>>> in practice we may because we control how we lower VLAs during >>>>>> gimplification, but nothing in the IL constraints say that the >>>>>> resulting pointer type should be special. >>>>>> >>>>>> Using a char[] decl shouldn't be a problem IMHO. >>>>>> >>>>>>>>>> And obviously you lose the optimization we arrange with inserting >>>>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no >>>>>>>>>> longer be shared for subsequent VLAs. Which means that you'd >>>>>>>>>> better limit the size you allow this promotion. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Right, I could introduce a parameter for this. >>>>>>>> >>>>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say, >>>>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10? >>>>>>>> >>>>>>> >>>>>>> That unfortunately is too small for the example from bug report. The >>>>>>> default >>>>>>> value of the param is 250, so that would be a threshold of 25, and the >>>>>>> alloca >>>>>>> size of the example is 40. Perhaps we can try a threshold of >>>>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such? >>>>>> >>>>>> Hm. estimated_stack_size is not O(1), so no. I think we need to >>>>>> find a sensible way of allowing stack sharing. Eventually Michas >>>>>> patch for introducing points-of-death would help here, if we'd >>>>>> go for folding this during stack-save/restore optimization. >>>>>> >>>>> >>>>> I changed the heuristics to this: >>>>> >>>>> + /* Heuristic: don't fold large vlas. */ >>>>> + threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE >>>>> (PARAM_LARGE_STACK_FRAME); >>>>> + /* In case a vla is declared at function scope, it has the same >>>>> lifetime as a >>>>> + declared array, so we allow a larger size. */ >>>>> + block = gimple_block (stmt); >>>>> + if (!(cfun->after_inlining >>>>> + && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL)) >>>>> + threshold /= 10; >>>>> + if (size > threshold) >>>>> + return NULL_TREE; >>>>> >>>>> The heuristics distinguishes between before and after inlining. >>>>> >>>>> After inlining, vla's declared at function scope have the same lifetimes >>>>> as >>>>> declared arrays, and don't share their space. There should be no negative >>>>> effects from folding an alloca in this case, but for safety we set a >>>>> threshold >>>>> of PARAM_LARGE_STACK_FRAME. >>>>> >>>>> Before inlining, such a vla might be inlined and share its space with >>>>> another >>>>> vla, so we stick with the normal threshold before inlining. >>>> >>>> That sounds reasonable, though the block check should probably use the >>>> original VLA decl block, not that of the basic-block of the allocation, >>>> but unfortunately we don't have access to that. So I suppose using >>>> the allocation basic-block BLOCK is good enough (still we don't >>>> really care about BLOCK boundaries when doing CFG manipulations, so >>>> the allocation bbs block may be not the outermost scope in more cases >>>> than necessary). >>>> >>>>> However, using this heuristic we still don't generate optimal code. >>>>> >>>>> During the first pass_ccp, the folding is not done, because the size (40) >>>>> is >>>>> larger than the threshold 25. The threshold is 25, because inlining is >>>>> not yet done. >>>>> >>>>> During pass_fold_builtins, the folding is done because it's after >>>>> inlining, but >>>>> it's later than pass_iv_optimize, so that still doesn't yield the optimal >>>>> size >>>>> of 64. >>>>> >>>>> The folding is not done during any of the other invocations or pass_ccp, >>>>> because >>>>> the argument has already become constant in the earlier invocation. >>>> >>>> Yeah, that's the issue with relying on folding to do this transformation. >>>> >>>>> Using this change, I manage to trigger folding during the second >>>>> invocation of >>>>> pass_ccp, before iv_optimize so we generate optimal code. >>>>> >>>>> Index: gcc/tree-ssa-ccp.c >>>>> =================================================================== >>>>> --- gcc/tree-ssa-ccp.c (revision 173734) >>>>> +++ gcc/tree-ssa-ccp.c (working copy) >>>>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi >>>>> if (gimple_call_internal_p (stmt)) >>>>> return false; >>>>> >>>>> + /* The heuristic of fold_builtin_alloca differs before and after >>>>> + inlining, so we don't require the arg to be changed into a >>>>> constant >>>>> + for folding, but just to be constant. */ >>>>> + if (gimple_call_alloca_for_var_p (stmt) >>>>> + && get_constant_value (gimple_call_arg (stmt, 0))) >>>> >>>> Probably reverse the get_constant_value check and the transformation >>> >>> Done. >>> >>>> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects, >>>> so its name should be changed). >>>> >>>>> + return true; >>>>> + >>>>> /* Propagate into the call arguments. Compared to replace_uses_in >>>>> this can use the argument slot types for type verification >>>>> instead of the current argument type. We also can safely >>>>> >>>>> But, to me it feels like a hack. Do you have any ideas how to do this >>>>> better? >>>> >>>> It's somewhat of a hack, but at least it is more of a defined place >>>> for this transformation - which then suggests to remove it from >>>> generic folding and only keep calling it from CCP this way. >>>> >>> >>> Done. >>> >> >> This is an updated version of the patch. I have 2 new patches and an updated >> testcase which I will sent out individually. >> > > Previously stores to vla's were marked as alive in dce because > is_hidden_global_store triggered. With them converted to stores to a normal > array, that doesn't happen anymore. > The stores where then sometimes removed, while there was an aliasing load, a > function argument. But since ref_may_be_aliased was called with WITH_SIZE_EXPR > as toplevel expr, ref_may_be_aliased returned false for the aliasing load. > This patch fixes this. > > OK for trunk?
Oops. Ok. Thanks, Richard. > Thanks, > - Tom > > 2011-07-30 Tom de Vries <t...@codesourcery.com> > > PR middle-end/43513 > * tree-ssa-dce.c (ref_may_be_aliased): Add assert. > (propagate_necessity): Handle WITH_SIZE_EXPR call arg. >