> On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guent...@gmail.com> 
> wrote:
> 
> On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.z...@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guent...@gmail.com> 
>>> wrote:
>>> 
>>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.z...@oracle.com> wrote:
>>>> 
>>>> Hi, Richard,
>>>> 
>>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t 
>>>> fixed this one yet, I was distracted by other tasks then just forgot this 
>>>> one….)
>>>> 
>>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guent...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <ja...@redhat.com> wrote:
>>>>>>> 
>>>>>>> Hi!
>>>>>>> 
>>>>>>> For IBM double double I've added in PR95450 and PR99648 verification 
>>>>>>> that
>>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a 
>>>>>>> REAL_CST
>>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
>>>>>>> verify it is the same.
>>>>>>> This is because our real.c support isn't able to represent all valid 
>>>>>>> values
>>>>>>> of IBM double double which has variable precision.
>>>>>>> In PR104522, it has been noted that we have similar problem with the
>>>>>>> Intel/Motorola extended XFmode formats, our internal representation 
>>>>>>> isn't
>>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and 
>>>>>>> unnormal
>>>>>>> values.
>>>>>>> So, the following patch is an attempt to extend that verification to all
>>>>>>> floats.
>>>>>>> Unfortunately, it wasn't that straightforward, because the
>>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs 
>>>>>>> to
>>>>>>> discover what bits are padding and does that by interpreting memory of
>>>>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
>>>>>>> sign with all mantissa bits set, but the verification includes also the
>>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
>>>>>>> and so fails the comparison check and so we ICE.
>>>>>>> The patch fixes that case by moving that verification from
>>>>>>> native_interpret_real to its caller, so that clear_padding_type can
>>>>>>> call native_interpret_real and avoid that extra check.
>>>>>>> 
>>>>>>> With this, the only thing that regresses in the testsuite is
>>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times 
>>>>>>> long\\t-16843010 5
>>>>>>> because it decides to use a pattern that has non-zero bits in the 
>>>>>>> padding
>>>>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
>>>>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit 
>>>>>>> worse
>>>>>>> code at all opt levels) something like:
>>>>>>>     movabsq $-72340172838076674, %rax
>>>>>>>     movabsq $-72340172838076674, %rdx
>>>>>>>     movq    %rax, -48(%rbp)
>>>>>>>     movq    %rdx, -40(%rbp)
>>>>>>>     fldt    -48(%rbp)
>>>>>>>     fstpt   -32(%rbp)
>>>>>>> instead of
>>>>>>>     fldt    .LC2(%rip)
>>>>>>>     fstpt   -32(%rbp)
>>>>>>> ...
>>>>>>> .LC2:
>>>>>>>     .long   -16843010
>>>>>>>     .long   -16843010
>>>>>>>     .long   65278
>>>>>>>     .long   0
>>>>>>> Note, neither of those sequences actually stores the padding bits, fstpt
>>>>>>> simply doesn't touch them.
>>>>>>> For vars with clear_padding_real_needs_padding_p types that are 
>>>>>>> allocated
>>>>>>> to memory at expansion time, I'd say much better would be to do the 
>>>>>>> stores
>>>>>>> using integral modes rather than XFmode, so do that:
>>>>>>>     movabsq $-72340172838076674, %rax
>>>>>>>    movq    %rax, -32(%rbp)
>>>>>>>    movq    %rax, -24(%rbp)
>>>>>>> directly.  That is the only way to ensure the padding bits are 
>>>>>>> initialized
>>>>>>> (or expand __builtin_clear_padding, but then you initialize separately 
>>>>>>> the
>>>>>>> value bits and padding bits).
>>>>>>> 
>>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as 
>>>>>>> mentioned
>>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
>>>>>> 
>>>>>> Thanks, I will try to fix this testing case in a later patch.
>>>>> 
>>>>> I've looked at this FAIL now and really wonder whether "pattern init" as
>>>>> implemented makes any sense for non-integral types.
>>>>> We end up with
>>>>> initializing a register (SSA name) with
>>>>> 
>>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
>>>>> 
>>>>> as we go building a TImode constant (we verified we have a TImode SET!)
>>>>> but then
>>>>> 
>>>>>        /* Pun the LHS to make sure its type has constant size
>>>>>           unless it is an SSA name where that's already known.  */
>>>>>        if (TREE_CODE (lhs) != SSA_NAME)
>>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>>>        else
>>>>>          init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>> ...
>>>>>    expand_assignment (lhs, init, false);
>>>>> 
>>>>> and generally registers do not have any padding.  This weird expansion
>>>>> then causes us to spill the TImode constant and reload the XFmode value,
>>>>> which is definitely not optimal here.
>>>>> 
>>>>> One approach to avoid the worse code generation would be to use mode
>>>>> specific patterns for registers (like using a NaN or a target specific
>>>>> value that
>>>>> can be loaded cheaply),
>>>> 
>>>> You mean that using “mode specific patterns” ONLY for registers?
>>>> Can we use “mode specific patterns” consistently for both registers and 
>>>> memory?
>>> 
>>> The difference is that registers don't have padding while memory
>>> possibly does, also
>>> for aggregates using different patterns is too expensive IMHO.  For
>>> registers the extra
>>> complication with generic patterns is that efficient initialization
>>> (without going through memory)
>>> should be a priority IMHO.
>>> 
>>> And for stack memory I still think that initializing the full
>>> allocated frame (including padding
>>> between variables!) is the best approach.
>>> 
>>>> LLVM use different patterns for different types (Integral, Float, pointer 
>>>> type, etc…) in order to
>>>> Catch bugs easily for different types.
>>>> 
>>>> The beginning versions of this patch tried to do similar as LLVM, but 
>>>> later we gave up this and
>>>> Use the same pattern for all different types.
>>>> 
>>>> If now, we want to use “mode specific patterns” for registers, I am 
>>>> wondering whether it’s
>>>> Better to just use “mode specific patterns” consistently for both 
>>>> registers and memory?
>>>> 
>>>>> another would be to simply fall back to zero
>>>>> initialization
>>>>> when we fail to constant fold the initializer like with
>>>>> 
>>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
>>>>> index 8b1733e20c4..a4b995f71e4 100644
>>>>> --- a/gcc/internal-fn.cc
>>>>> +++ b/gcc/internal-fn.cc
>>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
>>>>>        if (TREE_CODE (lhs) != SSA_NAME)
>>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>>>        else
>>>>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>> +           {
>>>>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), 
>>>>> init);
>>>>> +             if (!CONSTANT_CLASS_P (init))
>>>>> +               init = build_zero_cst (TREE_TYPE (lhs));
>>>>> +           }
>>>>>      }
>>>>>     else
>>>>>      /* Use zero-init also for variable-length sizes.  */
>>>>> 
>>>>> note that still does not address the issue noted by Jakub that we do not
>>>>> initialize padding this way - but of course that's because we expand a
>>>>> special assignment from .DEFERRED_INIT and are not initializing
>>>>> the storage allocated for the variable on the stack (at -O0) by RTL
>>>>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
>>>>> would take place there, simply filling the allocated frame with the
>>>>> pattern or zero.  That would probably mean that RTL expansion
>>>>> should scoop up .DEFERRED_INITs for variables it decides not to
>>>>> expand to pseudos and not leave that to stmt expansion.
>>>> 
>>>> Need to study this a little bit more...
>> 
>> 
>> So, Is what you mean in the above a complete rewrite of the current 
>> “expand_DEFERRED_INIT”:
>> 
>> Instead of simply using “expand_builtin_memset” for “variables that are not 
>> in register” and  “expand_assignment”
>> for “variables that are in register”,  RTL expansion should directly expand 
>> this call in a lower level:
>> 
>> i.e,
>> 
>>  tree lhs = gimple_call_lhs (stmt);
>>  …
>> 
>>  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>> 
>>  If (MEM_P (target))   // the variable is allocated on stack
>>  {
>>     // filling the allocated frame with pattern or zero.   How to do it??
>>  }
>>  else                     // the variable is in pseudo register.
>>  {
>>    rtx init_rtx = …;
>>    emit_move_insn (target, init_rtx)
>>  }
>> 
>> 
>> Is the above a correct understanding?
> 
> No, I would simply expand the automatic in-memory var .DEFERRED_INIT
> calls to nothing
> but instead process their effect at the time we do
> expand_one_stack_var, which means at
> CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
> initialize the variables.  We can then group the variables accordingly
> and block-initialize
> the stack space also covering padding inbetween variables.

I spent some time reading cfgexpand, especially the “expand_used_vars” part. I 
have the following questions:

1. Does “Group variables accordingly” mean: for all the auto variables that 
associate with .DEFERRED_INIT, during “expand_used_var” 
Phase, we should: first “defer the stack allocation” and marking them as 
“DEFERRED_INIT” , and later, during the stack grouping phase, in addition to 
the current 
“Phase 1”, “phase 2”, “phase 3” “expand_stack_vars”, add another phase in  
“expand_stack_vars” for the ones marked with “DEFERRED_INIT”, group them 
together,
And then “block-initialize” this part of the stack space?

2. Looks like that “expand_used_vars” is done BEFORE 
“expand_gimple_basic_block”, So, if we insert the pattern-initialization RTLs 
for
Block-initializing  the stack space, how can we do it in the “expand_used_vars” 
phase before all the RTL statements are emitted?

Qing

> 
> Richard.
> 
>> Thanks.
>> 
>> Qing
>> 
>>>> 
>>>>> 
>>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened
>>>>> PR105259 so we can revisit the above for GCC 13.
>>>> 
>>>> I can further study this bug and try to come up with a good solution in 
>>>> gcc13.
>>>> 
>>>> Thank you!
>>>> 
>>>> Qing
>>>>> 
>>>>> Richard.
>>>>> 
>>>>>> Qing

Reply via email to