Re: [PATCH PR43513, 1/3] Replace vla with array - Implementation.

Tom de Vries Sat, 30 Jul 2011 00:22:25 -0700

Hi,

On 07/28/2011 08:20 PM, Tom de Vries wrote:
> On 07/28/2011 06:25 PM, Richard Guenther wrote:
>> On Thu, 28 Jul 2011, Tom de Vries wrote:
>>
>>> On 07/28/2011 12:22 PM, Richard Guenther wrote:
>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>
>>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote:
>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>
>>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote:
>>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>>
>>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote:
>>>>>>>>>> Hi Richard,
>>>>>>>>>>
>>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted 
>>>>>>>>>> twice.
>>>>>>>>>>
>>>>>>>>>> 01_pr43513.3.patch
>>>>>>>>>> 02_pr43513.3.test.patch
>>>>>>>>>> 03_pr43513.3.mudflap.patch
>>>>>>>>>>
>>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64.
>>>>>>>>>>
>>>>>>>>>> I will sent out the patches individually.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant 
>>>>>>>>> argument with an
>>>>>>>>> array declaration.
>>>>>>>>>
>>>>>>>>> OK for trunk?
>>>>>>>>
>>>>>>>> I don't think it is safe to try to get at the VLA type the way you do.
>>>>>>>
>>>>>>> I don't understand in what way it's not safe. Do you mean I don't 
>>>>>>> manage to find
>>>>>>> the type always, or that I find the wrong type, or something else?
>>>>>>
>>>>>> I think you might get the wrong type,
>>>>>
>>>>> Ok, I'll review that code one more time.
>>>>>
>>>>>> you also do not transform code
>>>>>> like
>>>>>>
>>>>>>   int *p = alloca(4);
>>>>>>   *p = 3;
>>>>>>
>>>>>> as there is no array type involved here.
>>>>>>
>>>>>
>>>>> I was trying to stay away from non-vla allocas.  A source declared alloca 
>>>>> has
>>>>> function livetime, so we could have a single alloca in a loop, called 10 
>>>>> times,
>>>>> with all 10 instances live at the same time. This patch does not detect 
>>>>> such
>>>>> cases, and thus stays away from non-vla allocas. A vla decl does not have 
>>>>> such
>>>>> problems, the lifetime ends when it goes out of scope.
>>>>
>>>> Yes indeed - that probably would require more detailed analysis.
>>>>
>>>>>>>> In fact I would simply do sth like
>>>>>>>>
>>>>>>>>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>>>>>>>   n_elem = size * 8 / BITS_PER_UNIT;
>>>>>>>>   array_type = build_array_type_nelts (elem_type, n_elem);
>>>>>>>>   var = create_tmp_var (array_type, NULL);
>>>>>>>>   return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var));
>>>>>>>>
>>>>>>>
>>>>>>> I tried this code on the example, and it works, but the newly declared 
>>>>>>> type has
>>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment.  
>>>>>>> This make
>>>>>>> the memory access in the example potentially unaligned, which prohibits 
>>>>>>> an
>>>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64 
>>>>>>> achieved
>>>>>>> with my current patch.
>>>>>>
>>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable
>>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)).  Basically the
>>>>>> alignment that the targets alloca function would guarantee.
>>>>>>
>>>>>
>>>>> I tried that, but that doesn't help. It's the alignment of the type that
>>>>> matters, not of the decl.
>>>>
>>>> It shouldn't.  All accesses are performed with the original types and
>>>> alignment comes from that (plus the underlying decl).
>>>>
>>>
>>> I managed to get it all working by using build_aligned_type rather that 
>>> DECL_ALIGN.
>>
>> That's really odd, DECL_ALIGN should just work - nothing refers to the
>> type of the decl in the IL.  Can you try also setting DECL_USER_ALIGN to 
>> 1 maybe?
>>
> 
> This doesn't work either.
> 
>   /* Declare array.  */
>   elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>   n_elem = size * 8 / BITS_PER_UNIT;
>   align = MIN (size * 8, GET_MODE_PRECISION (word_mode));
>   array_type = build_array_type_nelts (elem_type, n_elem);
>   var = create_tmp_var (array_type, NULL);
>   DECL_ALIGN (var) = align;
>   DECL_USER_ALIGN (var) = 1;
> 
> Maybe this clarifies it:
> 
> Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at
> /home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621
> (gdb) call debug_generic_expr (ref)
> MEM[(int[0:D.2579] *)&D.2595][0]
> (gdb) call debug_generic_expr (step)
> 4
> 
> 1627    base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
> (gdb) call debug_generic_expr (base)
> D.2595
> 
> 1629    base_type = TREE_TYPE (base);
> (gdb) call debug_generic_expr (base_type)
> <unnamed-unsigned:8>[40]
> 
> 1630    base_align = TYPE_ALIGN (base_type);
> (gdb) p base_align
> $1 = 8
> 
> So the align is 8-bits, and we return true here:
> 
> (gdb) n
> 1632    if (mode != BLKmode)
> (gdb) n
> 1634        unsigned mode_align = GET_MODE_ALIGNMENT (mode);
> (gdb)
> 1636        if (base_align < mode_align
> (gdb)
> 1639          return true;
> 
> 
> Here we can see that the base actually has the (user) align on it:
> 
> (gdb) call debug_tree (base)
>  <var_decl 0xf7e1b420 D.2595
>     type <array_type 0xf7e1b360
>         type <integer_type 0xf7e1b2a0 public unsigned QI
>             size <integer_cst 0xf7d3d604 constant 8>
>             unit size <integer_cst 0xf7d3d620 constant 1>
>             align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision 
> 8
>             min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255>
>             pointer_to_this <pointer_type 0xf7e1b3c0>>
>         BLK
>         size <integer_cst 0xf7d5d070 constant 320>
>         unit size <integer_cst 0xf7dde2a0 constant 40>
>         align 8 symtab 0 alias set -1 canonical type 0xf7e1b360
>         domain <integer_type 0xf7de12a0
>                 type <integer_type 0xf7d51000 unsigned int>
>             SI
>             size <integer_cst 0xf7d3d78c constant 32>
>             unit size <integer_cst 0xf7d3d578 constant 4>
>             align 32 symtab 0 alias set -1 canonical type 0xf7de12a0
>             precision 32 min <integer_cst 0xf7d3d594 0>
>             max <integer_cst 0xf7dde284 39>>
>         pointer_to_this <pointer_type 0xf7e1b480>>
>     addressable used ignored BLK file pr43513.c line 4 col 6
>     size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40>
>     user align 32 context <function_decl 0xf7dfd480 foo3>>
> 
> 
>>>
>>>>> So should we try to find the base type of the vla, and use that, or use 
>>>>> the
>>>>> nonstandard char type?
>>>>
>>>> I don't think we can reliably find the base type of the vla - well,
>>>> in practice we may because we control how we lower VLAs during
>>>> gimplification, but nothing in the IL constraints say that the
>>>> resulting pointer type should be special.
>>>>
>>>> Using a char[] decl shouldn't be a problem IMHO.
>>>>
>>>>>>>> And obviously you lose the optimization we arrange with inserting
>>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no
>>>>>>>> longer be shared for subsequent VLAs.  Which means that you'd
>>>>>>>> better limit the size you allow this promotion.
>>>>>>>>
>>>>>>>
>>>>>>> Right, I could introduce a parameter for this.
>>>>>>
>>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say,
>>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10?
>>>>>>
>>>>>
>>>>> That unfortunately is too small for the example from bug report. The 
>>>>> default
>>>>> value of the param is 250, so that would be a threshold of 25, and the 
>>>>> alloca
>>>>> size of the example is 40.  Perhaps we can try a threshold of
>>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such?
>>>>
>>>> Hm.  estimated_stack_size is not O(1), so no.  I think we need to
>>>> find a sensible way of allowing stack sharing.  Eventually Michas
>>>> patch for introducing points-of-death would help here, if we'd
>>>> go for folding this during stack-save/restore optimization.
>>>>
>>>
>>> I changed the heuristics to this:
>>>
>>> +  /* Heuristic: don't fold large vlas.  */
>>> +  threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE 
>>> (PARAM_LARGE_STACK_FRAME);
>>> +  /* In case a vla is declared at function scope, it has the same lifetime 
>>> as a
>>> +     declared array, so we allow a larger size.  */
>>> +  block = gimple_block (stmt);
>>> +  if (!(cfun->after_inlining
>>> +        && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL))
>>> +    threshold /= 10;
>>> +  if (size > threshold)
>>> +    return NULL_TREE;
>>>
>>> The heuristics distinguishes between before and after inlining.
>>>
>>> After inlining, vla's declared at function scope have the same lifetimes as
>>> declared arrays, and don't share their space. There should be no negative
>>> effects from folding an alloca in this case, but for safety we set a 
>>> threshold
>>> of PARAM_LARGE_STACK_FRAME.
>>>
>>> Before inlining, such a vla might be inlined and share its space with 
>>> another
>>> vla, so we stick with the normal threshold before inlining.
>>
>> That sounds reasonable, though the block check should probably use the
>> original VLA decl block, not that of the basic-block of the allocation,
>> but unfortunately we don't have access to that.  So I suppose using
>> the allocation basic-block BLOCK is good enough (still we don't
>> really care about BLOCK boundaries when doing CFG manipulations, so
>> the allocation bbs block may be not the outermost scope in more cases
>> than necessary).
>>
>>> However, using this heuristic we still don't generate optimal code.
>>>
>>> During the first pass_ccp, the folding is not done, because the size (40) is
>>> larger than the threshold 25. The threshold is 25, because inlining is not 
>>> yet done.
>>>
>>> During pass_fold_builtins, the folding is done because it's after inlining, 
>>> but
>>> it's later than pass_iv_optimize, so that still doesn't yield the optimal 
>>> size
>>> of 64.
>>>
>>> The folding is not done during any of the other invocations or pass_ccp, 
>>> because
>>> the argument has already become constant in the earlier invocation.
>>
>> Yeah, that's the issue with relying on folding to do this transformation.
>>
>>> Using this change, I manage to trigger folding during the second invocation 
>>> of
>>> pass_ccp, before iv_optimize so we generate optimal code.
>>>
>>> Index: gcc/tree-ssa-ccp.c
>>> ===================================================================
>>> --- gcc/tree-ssa-ccp.c (revision 173734)
>>> +++ gcc/tree-ssa-ccp.c (working copy)
>>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi
>>>     if (gimple_call_internal_p (stmt))
>>>       return false;
>>>
>>> +        /* The heuristic of fold_builtin_alloca differs before and after
>>> +           inlining, so we don't require the arg to be changed into a 
>>> constant
>>> +           for folding, but just to be constant.  */
>>> +        if (gimple_call_alloca_for_var_p (stmt)
>>> +            && get_constant_value (gimple_call_arg (stmt, 0)))
>>
>> Probably reverse the get_constant_value check and the transformation
> 
> Done.
> 
>> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects,
>> so its name should be changed).
>>
>>> +          return true;
>>> +
>>>     /* Propagate into the call arguments.  Compared to replace_uses_in
>>>        this can use the argument slot types for type verification
>>>        instead of the current argument type.  We also can safely
>>>
>>> But, to me it feels like a hack. Do you have any ideas how to do this 
>>> better?
>>
>> It's somewhat of a hack, but at least it is more of a defined place
>> for this transformation - which then suggests to remove it from
>> generic folding and only keep calling it from CCP this way.
>>
> 
> Done.
>


This is an updated version of the patch. I have 2 new patches and an updated
testcase which I will sent out individually.

Patch set was bootstrapped and reg-tested on x86_64.

Ok for trunk?

Thanks,
- Tom

2011-07-30  Tom de Vries  <[email protected]>

        PR middle-end/43513
        * Makefile.in (tree-ssa-ccp.o): Add $(PARAMS_H) to rule.
        * tree-ssa-ccp.c (params.h): Include.
        (fold_builtin_alloca_for_var): New function.
        (ccp_fold_stmt): Use fold_builtin_alloca_for_var.

Index: gcc/tree-ssa-ccp.c
===================================================================
--- gcc/tree-ssa-ccp.c (revision 173734)
+++ gcc/tree-ssa-ccp.c (working copy)
@@ -133,6 +133,7 @@ along with GCC; see the file COPYING3.  
 #include "diagnostic-core.h"
 #include "dbgcnt.h"
 #include "gimple-fold.h"
+#include "params.h"
 
 
 /* Possible lattice values.  */
@@ -1659,6 +1660,51 @@ evaluate_stmt (gimple stmt)
   return val;
 }
 
+/* Detects a vla-related alloca with a constant argument.  Declares fixed-size
+   array and return the address, if found, otherwise returns NULL_TREE.  */
+
+static tree
+fold_builtin_alloca_for_var (gimple stmt)
+{
+  unsigned HOST_WIDE_INT size, threshold, n_elem;
+  tree lhs, arg, block, var, elem_type, array_type;
+  unsigned int align;
+
+  /* Get lhs.  */
+  lhs = gimple_call_lhs (stmt);
+  if (lhs == NULL_TREE)
+    return NULL_TREE;
+
+  /* Detect constant argument.  */
+  arg = get_constant_value (gimple_call_arg (stmt, 0));
+  if (arg == NULL_TREE || TREE_CODE (arg) != INTEGER_CST
+      || !host_integerp (arg, 1))
+    return NULL_TREE;
+  size = TREE_INT_CST_LOW (arg);
+
+  /* Heuristic: don't fold large vlas.  */
+  threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME);
+  /* In case a vla is declared at function scope, it has the same lifetime as a
+     declared array, so we allow a larger size.  */
+  block = gimple_block (stmt);
+  if (!(cfun->after_inlining
+        && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL))
+    threshold /= 10;
+  if (size > threshold)
+    return NULL_TREE;
+
+  /* Declare array.  */
+  elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
+  n_elem = size * 8 / BITS_PER_UNIT;
+  align = MIN (size * 8, GET_MODE_PRECISION (word_mode));
+  array_type = build_array_type_nelts (elem_type, n_elem);
+  var = create_tmp_var (array_type, NULL);
+  DECL_ALIGN (var) = align;
+
+  /* Fold alloca to the address of the array.  */
+  return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var));
+}
+
 /* Fold the stmt at *GSI with CCP specific information that propagating
    and regular folding does not catch.  */
 
@@ -1727,6 +1773,20 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi
 	if (gimple_call_internal_p (stmt))
 	  return false;
 
+        /* The heuristic of fold_builtin_alloca_for_var differs before and after
+           inlining, so we don't require the arg to be changed into a constant
+           for folding, but just to be constant.  */
+        if (gimple_call_alloca_for_var_p (stmt))
+          {
+            tree new_rhs = fold_builtin_alloca_for_var (stmt);
+            bool res;
+            if (new_rhs == NULL_TREE)
+              return false;
+            res = update_call_from_tree (gsi, new_rhs);
+            gcc_assert (res);
+            return true;
+          }
+
 	/* Propagate into the call arguments.  Compared to replace_uses_in
 	   this can use the argument slot types for type verification
 	   instead of the current argument type.  We also can safely
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in (revision 173734)
+++ gcc/Makefile.in (working copy)
@@ -3157,7 +3157,7 @@ tree-call-cdce.o : tree-call-cdce.c $(CO
 tree-ssa-ccp.o : tree-ssa-ccp.c $(TREE_FLOW_H) $(CONFIG_H) \
    $(SYSTEM_H) $(TREE_H) $(TM_P_H) $(EXPR_H) output.h \
    $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
-   $(TREE_DUMP_H) $(BASIC_BLOCK_H) $(TREE_PASS_H) langhooks.h \
+   $(TREE_DUMP_H) $(BASIC_BLOCK_H) $(TREE_PASS_H) langhooks.h  $(PARAMS_H) \
    tree-ssa-propagate.h value-prof.h $(FLAGS_H) $(TARGET_H) $(DIAGNOSTIC_CORE_H) \
    $(DBGCNT_H) tree-pretty-print.h gimple-pretty-print.h gimple-fold.h
 tree-sra.o : tree-sra.c $(CONFIG_H) $(SYSTEM_H) coretypes.h alloc-pool.h \

Re: [PATCH PR43513, 1/3] Replace vla with array - Implementation.

Reply via email to