On Fri, Apr 22, 2022 at 5:26 PM Qing Zhao <qing.z...@oracle.com> wrote: > > > > > On Apr 21, 2022, at 2:09 AM, Richard Biener <richard.guent...@gmail.com> > > wrote: > > > > On Wed, Apr 20, 2022 at 6:02 PM Qing Zhao <qing.z...@oracle.com> wrote: > >> > >> > >> > >>> On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guent...@gmail.com> > >>> wrote: > >>> > >>> On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.z...@oracle.com> wrote: > >>>> > >>>> > >>>> > >>>>> On Apr 14, 2022, at 1:53 AM, Richard Biener > >>>>> <richard.guent...@gmail.com> wrote: > >>>>> > >>>>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.z...@oracle.com> wrote: > >>>>>> > >>>>>> Hi, Richard, > >>>>>> > >>>>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t > >>>>>> fixed this one yet, I was distracted by other tasks then just forgot > >>>>>> this one….) > >>>>>> > >>>>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener > >>>>>>> <richard.guent...@gmail.com> wrote: > >>>>>>> > >>>>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches > >>>>>>> <gcc-patches@gcc.gnu.org> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <ja...@redhat.com> wrote: > >>>>>>>>> > >>>>>>>>> Hi! > >>>>>>>>> > >>>>>>>>> For IBM double double I've added in PR95450 and PR99648 > >>>>>>>>> verification that > >>>>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a > >>>>>>>>> REAL_CST > >>>>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes > >>>>>>>>> and > >>>>>>>>> verify it is the same. > >>>>>>>>> This is because our real.c support isn't able to represent all > >>>>>>>>> valid values > >>>>>>>>> of IBM double double which has variable precision. > >>>>>>>>> In PR104522, it has been noted that we have similar problem with the > >>>>>>>>> Intel/Motorola extended XFmode formats, our internal representation > >>>>>>>>> isn't > >>>>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and > >>>>>>>>> unnormal > >>>>>>>>> values. > >>>>>>>>> So, the following patch is an attempt to extend that verification > >>>>>>>>> to all > >>>>>>>>> floats. > >>>>>>>>> Unfortunately, it wasn't that straightforward, because the > >>>>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles > >>>>>>>>> needs to > >>>>>>>>> discover what bits are padding and does that by interpreting memory > >>>>>>>>> of > >>>>>>>>> all 1s. That is actually a valid supported value, a qNaN with > >>>>>>>>> negative > >>>>>>>>> sign with all mantissa bits set, but the verification includes also > >>>>>>>>> the > >>>>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure > >>>>>>>>> out) > >>>>>>>>> and so fails the comparison check and so we ICE. > >>>>>>>>> The patch fixes that case by moving that verification from > >>>>>>>>> native_interpret_real to its caller, so that clear_padding_type can > >>>>>>>>> call native_interpret_real and avoid that extra check. > >>>>>>>>> > >>>>>>>>> With this, the only thing that regresses in the testsuite is > >>>>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times > >>>>>>>>> long\\t-16843010 5 > >>>>>>>>> because it decides to use a pattern that has non-zero bits in the > >>>>>>>>> padding > >>>>>>>>> bits of the long double, so the simplify-rtx.cc change prevents > >>>>>>>>> folding > >>>>>>>>> a SUBREG into a constant. We emit (the testcase is -O0 but we emit > >>>>>>>>> worse > >>>>>>>>> code at all opt levels) something like: > >>>>>>>>> movabsq $-72340172838076674, %rax > >>>>>>>>> movabsq $-72340172838076674, %rdx > >>>>>>>>> movq %rax, -48(%rbp) > >>>>>>>>> movq %rdx, -40(%rbp) > >>>>>>>>> fldt -48(%rbp) > >>>>>>>>> fstpt -32(%rbp) > >>>>>>>>> instead of > >>>>>>>>> fldt .LC2(%rip) > >>>>>>>>> fstpt -32(%rbp) > >>>>>>>>> ... > >>>>>>>>> .LC2: > >>>>>>>>> .long -16843010 > >>>>>>>>> .long -16843010 > >>>>>>>>> .long 65278 > >>>>>>>>> .long 0 > >>>>>>>>> Note, neither of those sequences actually stores the padding bits, > >>>>>>>>> fstpt > >>>>>>>>> simply doesn't touch them. > >>>>>>>>> For vars with clear_padding_real_needs_padding_p types that are > >>>>>>>>> allocated > >>>>>>>>> to memory at expansion time, I'd say much better would be to do the > >>>>>>>>> stores > >>>>>>>>> using integral modes rather than XFmode, so do that: > >>>>>>>>> movabsq $-72340172838076674, %rax > >>>>>>>>> movq %rax, -32(%rbp) > >>>>>>>>> movq %rax, -24(%rbp) > >>>>>>>>> directly. That is the only way to ensure the padding bits are > >>>>>>>>> initialized > >>>>>>>>> (or expand __builtin_clear_padding, but then you initialize > >>>>>>>>> separately the > >>>>>>>>> value bits and padding bits). > >>>>>>>>> > >>>>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as > >>>>>>>>> mentioned > >>>>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved. > >>>>>>>> > >>>>>>>> Thanks, I will try to fix this testing case in a later patch. > >>>>>>> > >>>>>>> I've looked at this FAIL now and really wonder whether "pattern init" > >>>>>>> as > >>>>>>> implemented makes any sense for non-integral types. > >>>>>>> We end up with > >>>>>>> initializing a register (SSA name) with > >>>>>>> > >>>>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe) > >>>>>>> > >>>>>>> as we go building a TImode constant (we verified we have a TImode > >>>>>>> SET!) > >>>>>>> but then > >>>>>>> > >>>>>>> /* Pun the LHS to make sure its type has constant size > >>>>>>> unless it is an SSA name where that's already known. */ > >>>>>>> if (TREE_CODE (lhs) != SSA_NAME) > >>>>>>> lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs); > >>>>>>> else > >>>>>>> init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init); > >>>>>>> ... > >>>>>>> expand_assignment (lhs, init, false); > >>>>>>> > >>>>>>> and generally registers do not have any padding. This weird expansion > >>>>>>> then causes us to spill the TImode constant and reload the XFmode > >>>>>>> value, > >>>>>>> which is definitely not optimal here. > >>>>>>> > >>>>>>> One approach to avoid the worse code generation would be to use mode > >>>>>>> specific patterns for registers (like using a NaN or a target specific > >>>>>>> value that > >>>>>>> can be loaded cheaply), > >>>>>> > >>>>>> You mean that using “mode specific patterns” ONLY for registers? > >>>>>> Can we use “mode specific patterns” consistently for both registers > >>>>>> and memory? > >>>>> > >>>>> The difference is that registers don't have padding while memory > >>>>> possibly does, also > >>>>> for aggregates using different patterns is too expensive IMHO. For > >>>>> registers the extra > >>>>> complication with generic patterns is that efficient initialization > >>>>> (without going through memory) > >>>>> should be a priority IMHO. > >>>>> > >>>>> And for stack memory I still think that initializing the full > >>>>> allocated frame (including padding > >>>>> between variables!) is the best approach. > >>>>> > >>>>>> LLVM use different patterns for different types (Integral, Float, > >>>>>> pointer type, etc…) in order to > >>>>>> Catch bugs easily for different types. > >>>>>> > >>>>>> The beginning versions of this patch tried to do similar as LLVM, but > >>>>>> later we gave up this and > >>>>>> Use the same pattern for all different types. > >>>>>> > >>>>>> If now, we want to use “mode specific patterns” for registers, I am > >>>>>> wondering whether it’s > >>>>>> Better to just use “mode specific patterns” consistently for both > >>>>>> registers and memory? > >>>>>> > >>>>>>> another would be to simply fall back to zero > >>>>>>> initialization > >>>>>>> when we fail to constant fold the initializer like with > >>>>>>> > >>>>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc > >>>>>>> index 8b1733e20c4..a4b995f71e4 100644 > >>>>>>> --- a/gcc/internal-fn.cc > >>>>>>> +++ b/gcc/internal-fn.cc > >>>>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt) > >>>>>>> if (TREE_CODE (lhs) != SSA_NAME) > >>>>>>> lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs); > >>>>>>> else > >>>>>>> - init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), > >>>>>>> init); > >>>>>>> + { > >>>>>>> + init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), > >>>>>>> init); > >>>>>>> + if (!CONSTANT_CLASS_P (init)) > >>>>>>> + init = build_zero_cst (TREE_TYPE (lhs)); > >>>>>>> + } > >>>>>>> } > >>>>>>> else > >>>>>>> /* Use zero-init also for variable-length sizes. */ > >>>>>>> > >>>>>>> note that still does not address the issue noted by Jakub that we do > >>>>>>> not > >>>>>>> initialize padding this way - but of course that's because we expand a > >>>>>>> special assignment from .DEFERRED_INIT and are not initializing > >>>>>>> the storage allocated for the variable on the stack (at -O0) by RTL > >>>>>>> expansion. Ideally .DEFERRED_INIT "expansion" for stack variables > >>>>>>> would take place there, simply filling the allocated frame with the > >>>>>>> pattern or zero. That would probably mean that RTL expansion > >>>>>>> should scoop up .DEFERRED_INITs for variables it decides not to > >>>>>>> expand to pseudos and not leave that to stmt expansion. > >>>>>> > >>>>>> Need to study this a little bit more... > >>>> > >>>> > >>>> So, Is what you mean in the above a complete rewrite of the current > >>>> “expand_DEFERRED_INIT”: > >>>> > >>>> Instead of simply using “expand_builtin_memset” for “variables that are > >>>> not in register” and “expand_assignment” > >>>> for “variables that are in register”, RTL expansion should directly > >>>> expand this call in a lower level: > >>>> > >>>> i.e, > >>>> > >>>> tree lhs = gimple_call_lhs (stmt); > >>>> … > >>>> > >>>> rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); > >>>> > >>>> If (MEM_P (target)) // the variable is allocated on stack > >>>> { > >>>> // filling the allocated frame with pattern or zero. How to do it?? > >>>> } > >>>> else // the variable is in pseudo register. > >>>> { > >>>> rtx init_rtx = …; > >>>> emit_move_insn (target, init_rtx) > >>>> } > >>>> > >>>> > >>>> Is the above a correct understanding? > >>> > >>> No, I would simply expand the automatic in-memory var .DEFERRED_INIT > >>> calls to nothing > >>> but instead process their effect at the time we do > >>> expand_one_stack_var, which means at > >>> CFG expansion time scan for the .DEFERRED_INITs and record whether and > >>> how to > >>> initialize the variables. We can then group the variables accordingly > >>> and block-initialize > >>> the stack space also covering padding inbetween variables. > >> > >> So, the above only covers the variables that will be allocated in stack? > > > > Yes. > > > >> For the variables that are in pseudo registers, we still need to simply > >> expand the initialization to a move insn? > > > > Yes, also for stack VLAs. > > Why for stack VLAs?
VLAs are allocated with alloca which means the lifetime of VLAs might not be that of the whole function so we have to do it when the variable becomes live. Incidentially that also applies to variables we would be able to coalesce to the same stack slot, so we have to double-check that still works when attempting to block-initialize. > > > >> And for the variables in pseudo registers, later after register > >> allocation, some of them will be spilled into stack, too. > >> Specially for long double/complex double variables that might have > >> padding, shall we specially handle them during that time? > > > > Ideally we'd never spill uninitialized variables but definitely the > > auto-init can cause the need to spill that very > > value. That means that auto-init of registers should happen after RA? > > All the initialization of registers should happen after RA? Only the ones > with “long double/complex double” type that might have padding? Not sure, it needs some thoughts. Making sure all pseudos are initialized after RTL expansion would be another possibility. > > One would need to see what > > IRA/LRA do to uninitialized pseudo uses. Computing may uninit uses > > after RA should be possible and > > hopefully all ISAs allow initializing hard registers without requiring > > a scratch register (heh - I'm almost sure > > there will be exceptions …) > Need more study here…. Definitely. > Qing > > > > Richard. > > > >> Qing > >>> > >>> Richard. > >>> > >>>> Thanks. > >>>> > >>>> Qing > >>>> > >>>>>> > >>>>>>> > >>>>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened > >>>>>>> PR105259 so we can revisit the above for GCC 13. > >>>>>> > >>>>>> I can further study this bug and try to come up with a good solution > >>>>>> in gcc13. > >>>>>> > >>>>>> Thank you! > >>>>>> > >>>>>> Qing > >>>>>>> > >>>>>>> Richard. > >>>>>>> > >>>>>>>> Qing >