On Thu, Jul 28, 2011 at 7:20 PM, Tom de Vries <vr...@codesourcery.com> wrote: > On 07/28/2011 06:25 PM, Richard Guenther wrote: >> On Thu, 28 Jul 2011, Tom de Vries wrote: >> >>> On 07/28/2011 12:22 PM, Richard Guenther wrote: >>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>> >>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote: >>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>> >>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote: >>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote: >>>>>>>> >>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote: >>>>>>>>>> Hi Richard, >>>>>>>>>> >>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted >>>>>>>>>> twice. >>>>>>>>>> >>>>>>>>>> 01_pr43513.3.patch >>>>>>>>>> 02_pr43513.3.test.patch >>>>>>>>>> 03_pr43513.3.mudflap.patch >>>>>>>>>> >>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64. >>>>>>>>>> >>>>>>>>>> I will sent out the patches individually. >>>>>>>>>> >>>>>>>>> >>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant >>>>>>>>> argument with an >>>>>>>>> array declaration. >>>>>>>>> >>>>>>>>> OK for trunk? >>>>>>>> >>>>>>>> I don't think it is safe to try to get at the VLA type the way you do. >>>>>>> >>>>>>> I don't understand in what way it's not safe. Do you mean I don't >>>>>>> manage to find >>>>>>> the type always, or that I find the wrong type, or something else? >>>>>> >>>>>> I think you might get the wrong type, >>>>> >>>>> Ok, I'll review that code one more time. >>>>> >>>>>> you also do not transform code >>>>>> like >>>>>> >>>>>> int *p = alloca(4); >>>>>> *p = 3; >>>>>> >>>>>> as there is no array type involved here. >>>>>> >>>>> >>>>> I was trying to stay away from non-vla allocas. A source declared alloca >>>>> has >>>>> function livetime, so we could have a single alloca in a loop, called 10 >>>>> times, >>>>> with all 10 instances live at the same time. This patch does not detect >>>>> such >>>>> cases, and thus stays away from non-vla allocas. A vla decl does not have >>>>> such >>>>> problems, the lifetime ends when it goes out of scope. >>>> >>>> Yes indeed - that probably would require more detailed analysis. >>>> >>>>>>>> In fact I would simply do sth like >>>>>>>> >>>>>>>> elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); >>>>>>>> n_elem = size * 8 / BITS_PER_UNIT; >>>>>>>> array_type = build_array_type_nelts (elem_type, n_elem); >>>>>>>> var = create_tmp_var (array_type, NULL); >>>>>>>> return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var)); >>>>>>>> >>>>>>> >>>>>>> I tried this code on the example, and it works, but the newly declared >>>>>>> type has >>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment. >>>>>>> This make >>>>>>> the memory access in the example potentially unaligned, which prohibits >>>>>>> an >>>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64 >>>>>>> achieved >>>>>>> with my current patch. >>>>>> >>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable >>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)). Basically the >>>>>> alignment that the targets alloca function would guarantee. >>>>>> >>>>> >>>>> I tried that, but that doesn't help. It's the alignment of the type that >>>>> matters, not of the decl. >>>> >>>> It shouldn't. All accesses are performed with the original types and >>>> alignment comes from that (plus the underlying decl). >>>> >>> >>> I managed to get it all working by using build_aligned_type rather that >>> DECL_ALIGN. >> >> That's really odd, DECL_ALIGN should just work - nothing refers to the >> type of the decl in the IL. Can you try also setting DECL_USER_ALIGN to >> 1 maybe? >> > > This doesn't work either. > > /* Declare array. */ > elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); > n_elem = size * 8 / BITS_PER_UNIT; > align = MIN (size * 8, GET_MODE_PRECISION (word_mode)); > array_type = build_array_type_nelts (elem_type, n_elem); > var = create_tmp_var (array_type, NULL); > DECL_ALIGN (var) = align; > DECL_USER_ALIGN (var) = 1; > > Maybe this clarifies it: > > Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at > /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621 > (gdb) call debug_generic_expr (ref) > MEM[(int[0:D.2579] *)&D.2595][0] > (gdb) call debug_generic_expr (step) > 4 > > 1627 base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode, > (gdb) call debug_generic_expr (base) > D.2595 > > 1629 base_type = TREE_TYPE (base); > (gdb) call debug_generic_expr (base_type) > <unnamed-unsigned:8>[40] > > 1630 base_align = TYPE_ALIGN (base_type); > (gdb) p base_align > $1 = 8 > > So the align is 8-bits, and we return true here:
Ah, but this code should use get_object_alignment, not solely look at the type. Richard. > (gdb) n > 1632 if (mode != BLKmode) > (gdb) n > 1634 unsigned mode_align = GET_MODE_ALIGNMENT (mode); > (gdb) > 1636 if (base_align < mode_align > (gdb) > 1639 return true; > > > Here we can see that the base actually has the (user) align on it: > > (gdb) call debug_tree (base) > <var_decl 0xf7e1b420 D.2595 > type <array_type 0xf7e1b360 > type <integer_type 0xf7e1b2a0 public unsigned QI > size <integer_cst 0xf7d3d604 constant 8> > unit size <integer_cst 0xf7d3d620 constant 1> > align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision 8 > min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255> > pointer_to_this <pointer_type 0xf7e1b3c0>> > BLK > size <integer_cst 0xf7d5d070 constant 320> > unit size <integer_cst 0xf7dde2a0 constant 40> > align 8 symtab 0 alias set -1 canonical type 0xf7e1b360 > domain <integer_type 0xf7de12a0 > type <integer_type 0xf7d51000 unsigned int> > SI > size <integer_cst 0xf7d3d78c constant 32> > unit size <integer_cst 0xf7d3d578 constant 4> > align 32 symtab 0 alias set -1 canonical type 0xf7de12a0 > precision 32 min <integer_cst 0xf7d3d594 0> > max <integer_cst 0xf7dde284 39>> > pointer_to_this <pointer_type 0xf7e1b480>> > addressable used ignored BLK file pr43513.c line 4 col 6 > size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40> > user align 32 context <function_decl 0xf7dfd480 foo3>> > > >>> >>>>> So should we try to find the base type of the vla, and use that, or use >>>>> the >>>>> nonstandard char type? >>>> >>>> I don't think we can reliably find the base type of the vla - well, >>>> in practice we may because we control how we lower VLAs during >>>> gimplification, but nothing in the IL constraints say that the >>>> resulting pointer type should be special. >>>> >>>> Using a char[] decl shouldn't be a problem IMHO. >>>> >>>>>>>> And obviously you lose the optimization we arrange with inserting >>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no >>>>>>>> longer be shared for subsequent VLAs. Which means that you'd >>>>>>>> better limit the size you allow this promotion. >>>>>>>> >>>>>>> >>>>>>> Right, I could introduce a parameter for this. >>>>>> >>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say, >>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10? >>>>>> >>>>> >>>>> That unfortunately is too small for the example from bug report. The >>>>> default >>>>> value of the param is 250, so that would be a threshold of 25, and the >>>>> alloca >>>>> size of the example is 40. Perhaps we can try a threshold of >>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such? >>>> >>>> Hm. estimated_stack_size is not O(1), so no. I think we need to >>>> find a sensible way of allowing stack sharing. Eventually Michas >>>> patch for introducing points-of-death would help here, if we'd >>>> go for folding this during stack-save/restore optimization. >>>> >>> >>> I changed the heuristics to this: >>> >>> + /* Heuristic: don't fold large vlas. */ >>> + threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE >>> (PARAM_LARGE_STACK_FRAME); >>> + /* In case a vla is declared at function scope, it has the same lifetime >>> as a >>> + declared array, so we allow a larger size. */ >>> + block = gimple_block (stmt); >>> + if (!(cfun->after_inlining >>> + && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL)) >>> + threshold /= 10; >>> + if (size > threshold) >>> + return NULL_TREE; >>> >>> The heuristics distinguishes between before and after inlining. >>> >>> After inlining, vla's declared at function scope have the same lifetimes as >>> declared arrays, and don't share their space. There should be no negative >>> effects from folding an alloca in this case, but for safety we set a >>> threshold >>> of PARAM_LARGE_STACK_FRAME. >>> >>> Before inlining, such a vla might be inlined and share its space with >>> another >>> vla, so we stick with the normal threshold before inlining. >> >> That sounds reasonable, though the block check should probably use the >> original VLA decl block, not that of the basic-block of the allocation, >> but unfortunately we don't have access to that. So I suppose using >> the allocation basic-block BLOCK is good enough (still we don't >> really care about BLOCK boundaries when doing CFG manipulations, so >> the allocation bbs block may be not the outermost scope in more cases >> than necessary). >> >>> However, using this heuristic we still don't generate optimal code. >>> >>> During the first pass_ccp, the folding is not done, because the size (40) is >>> larger than the threshold 25. The threshold is 25, because inlining is not >>> yet done. >>> >>> During pass_fold_builtins, the folding is done because it's after inlining, >>> but >>> it's later than pass_iv_optimize, so that still doesn't yield the optimal >>> size >>> of 64. >>> >>> The folding is not done during any of the other invocations or pass_ccp, >>> because >>> the argument has already become constant in the earlier invocation. >> >> Yeah, that's the issue with relying on folding to do this transformation. >> >>> Using this change, I manage to trigger folding during the second invocation >>> of >>> pass_ccp, before iv_optimize so we generate optimal code. >>> >>> Index: gcc/tree-ssa-ccp.c >>> =================================================================== >>> --- gcc/tree-ssa-ccp.c (revision 173734) >>> +++ gcc/tree-ssa-ccp.c (working copy) >>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi >>> if (gimple_call_internal_p (stmt)) >>> return false; >>> >>> + /* The heuristic of fold_builtin_alloca differs before and after >>> + inlining, so we don't require the arg to be changed into a >>> constant >>> + for folding, but just to be constant. */ >>> + if (gimple_call_alloca_for_var_p (stmt) >>> + && get_constant_value (gimple_call_arg (stmt, 0))) >> >> Probably reverse the get_constant_value check and the transformation > > Done. > >> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects, >> so its name should be changed). >> >>> + return true; >>> + >>> /* Propagate into the call arguments. Compared to replace_uses_in >>> this can use the argument slot types for type verification >>> instead of the current argument type. We also can safely >>> >>> But, to me it feels like a hack. Do you have any ideas how to do this >>> better? >> >> It's somewhat of a hack, but at least it is more of a defined place >> for this transformation - which then suggests to remove it from >> generic folding and only keep calling it from CCP this way. >> > > Done. > >> Richard. >> >>> Attaching untested patch for reference (will test overnight). >>> >>> Thanks, >>> - Tom >>> >>> 2011-07-28 Tom de Vries <t...@codesourcery.com> >>> >>> PR middle-end/43513 >>> * gimple-fold.c (params.h): Include. >>> (fold_builtin_alloca): New function. >>> (gimple_fold_builtin): Use fold_builtin_alloca. >>> * tree-ssa-ccp.c (ccp_fold_stmt): Force folding of vla-related alloca. >>> * Makefile.in (gimple-fold.o): Add $(PARAMS_H) to rule. > > Thanks, > - Tom >