On 09/05/2017 03:48 PM, Segher Boessenkool wrote:
> Hi!
> 
> On Sat, Sep 02, 2017 at 12:31:16AM -0600, Jeff Law wrote:
>> On 08/29/2017 05:14 PM, Segher Boessenkool wrote:
>>> Actually, everywhere it is used it has a Pmode == SImode wart before
>>> it, to emit the right update_stack insn...  So fold that into this
>>> function, name it rs6000_emit_allocate_stack_1?
>> Agreed.  That seems like a nice cleanup.  Look at the call from
>> rs6000_emit_allocate_stack.  Instead of a Pmode == SImode, it tests
>> TARGET_32BIT instead.  But I think that one is still safe to convert
>> based on how we set Pmode in rs6000_option_override_internal.
> 
> TARGET_32BIT is exactly the same as Pmode == SImode.  Sometimes the
> latter is more expressive (if it describes more directly what is being
> tested).
Thanks for the confirmation.


> 
>>> You don't describe what the return value here is (but neither does
>>> rs6000_emit_allocate_stack); it is the single instruction that actually
>>> changes the stack pointer?  For the split stack support?  (Does the stack
>>> clash code actually work with split stack, has that been tested?)
>> As far as I was able to ascertain (and essentially duplicate) we return
>> the insn that allocates the stack, but only if the allocation was
>> handled a single insn.  Otherwise we return NULL.
>>
>> But that was determined largely by guesswork.  It interacts with split
>> stack support, but I'm not entirely what it's meant to do.  If you have
>> insights here, I'll happily add comments to the routines -- when I was
>> poking at this stuff I was rather distressed to not have any real
>> guidance on what the return value should be.
>>
>> I have not tested stack clash and split stacks. ISTM they ought to be
>> able to work together, but I haven't though real deep about it.
> 
> (To test, I think testing Go on 64-bit is the best you can do).
Yea.  I'll do that.

> 
> rs6000_emit_prologue has a comment:
> 
>   /* sp_adjust is the stack adjusting instruction, tracked so that the
>      insn setting up the split-stack arg pointer can be emitted just
>      prior to it, when r12 is not used here for other purposes.  */
Hmm, looking at all this again, it may need further refinement.  ISTM
that if we have multiple adjustments (say due to an unrolled
probe/allocate loop), we want to return the first adjustment insn.  For
a loop we likely want return something before the loop.

It's amazing how much clearer some things can be after a few weeks away
and someone proving some guidance.

> 
>>> The way more common case is no probes at all, but that is handled in the
>>> caller; maybe move it to this function instead?
>> The comment is correct for what's handled by that code.  As you note the
>> most common case is no probing needed and in those cases we never call
>> rs6000_emit_probe_stack_range_stack_clash.
>>
>> The code was written that way so that the common case (no explicit
>> probes) would just fall into the old code.  That minimized the
>> possibility that I mucked anything up :-)
> 
> I think it's cleaner if you don't handle that check externally, but
> you decide.
It seems like it ought to be cleaner, but I don't think it is in reality.

In some ways the PPC is a bit different in that probing happens as a
side effect of allocation.  So we're always probing when we allocate
stack.    What the test really does is determine if we're breaking a
single allocation/probe into smaller allocations/probes or not.

I wonder if a rename here would be useful to reflect that reality for
PPC.  I also hate typing that overly long function name :-)  A rename
might get it down to something reasonable.



> 
>>> It seems the only thing preventing the probes from being elided is that
>>> GCC does not realise the probe stores are not actually needed?  I guess
>>> that should be enough, but maybe there should be a test for this.
>> I'm not sure what you're asking here.  In the code above we're
>> allocating a multiple of PROBE_INTERVAL bytes.  We must probe every
>> PROBE_INTERVAL bytes.
> 
> Yes, but what prevents later passes from getting rid of the stores?
> Maybe some volatile is needed?  The stack pointer updates cannot easily
> be removed by anything, but relying on that to also protect the stores
> is a bit fragile.
Ah.  Now I understand what you're asking.

We've been using a mixture of a magic note to prevent the stack
adjustments from being combined as well as blockage insns to prevent
code motion and other transformations.

The former came first, but I think the latter should eliminate the need
for the magic notes.  Let me do some testing around that.

Assuming I'm right we just need to make sure PPC emits the right
blockage insns.  While it's not really subject to the kinds of
optimizations we see on x86/aarch64, better safe than sorry.

> 
>>>> +  {
>>>> +    emit_move_insn (end_addr, GEN_INT (-rounded_size));
>>>> +    insn = emit_insn (gen_add3_insn (end_addr, end_addr,
>>>> +                                     stack_pointer_rtx));
>>>> +    add_reg_note (insn, REG_FRAME_RELATED_EXPR,
>>>> +                  gen_rtx_SET (end_addr,
>>>> +                               gen_rtx_PLUS (Pmode, stack_pointer_rtx,
>>>> +                                             rs)));
>>>> +  }
>>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>>
>>> What are these notes for?  Not totally obvious...  could use a comment :-)
>> At a high level we're describing to the CFI machinery how to compute the
>> CFA.
>>
>> ie we have
>>
>> (set (endreg) (-rounded_size))
>> (set (endreg) (endreg) (stack_pointer_rtx))
>>
>> But only the second insn actually participates in the CFA computation
>> because it's the only one marked as RTX_FRAME_RELATED.  But the CFI
>> machinery isn't going to be able to figure out what that insn does.  So
>> we describe it with a note.  Think of it as a REG_EQUAL note for the CFI
>> machinery.
> 
> Is end_addr used for calculating the CFA in any way?  That's what I'm
> not seeing.
It is on the other architectures.  It's invariant across the loop so we
can compute the CFA based on the end address - size to allocate.  That
way we're not dependent on the stack pointer which is varying within the
loop.

PPC does things slightly differently.  I'm trusting Jakub's knowledge of
the dwarf2cfi bits :-)

I tried to follow the logic in dwarf2cfi at one time, but didn't get
far.  The docs are unclear as well.


>>>> @@ -10272,6 +10272,15 @@
>>>>    rtx neg_op0;
>>>>    rtx insn, par, set, mem;
>>>>  
>>>> +  /* By allowing reg_or_cint_operand as the predicate we can get
>>>> +     better code for stack-clash-protection because we do not lose
>>>> +     size information.  But the rest of the code expects the operand
>>>> +     to be reg_or_short_operand.  If it isn't, then force it into
>>>> +     a register.  */
>>>> +  rtx orig_op1 = operands[1];
>>>> +  if (!reg_or_short_operand (operands[1], Pmode))
>>>> +    operands[1] = force_reg (Pmode, operands[1]);
>>>
>>> I don't understand the "we do not lose size information" bit.  Maybe
>>> you can show an example?
>> If you have a large constant alloca -- large enough that the size does
>> not fit in a reg_or_cint_operand, then the generic code in explow.c will
>> force the size into a register before calling this expander.
>>
>> Thus we don't know the allocation is constant within this expander which
>> limits our ability to rotate the loop, unroll the loop and/or elide any
>> residual allocations.
> 
> Ah, I see.  Could you put that tidbit in the comment please?  Something
> like "If we changed the operands[1] contraint to reg_or_short_operand
> we would not notice in here we are asked for a constant size allocation
> if it does not fit 16 bits (the caller would force the constant into a
> register)."
Will do.

> 
>>>> +      /* We do occasionally get in here with constant sizes, we might
>>>> +   as well do a reasonable job when we obviously can.  */
>>>> +      if (rounded_size != CONST0_RTX (Pmode))
>>>
>>> That's just const0_rtx.
>> The canonical way is to test against CONST0_RTX (mode).  Testing
>> "const0_rtx" may be safe here, but in general the right way is
>> CONST0_RTX (mode).
> 
> But mode is an integer mode here, so const0_rtx is fine (and shorter
> and easier to read, and more usual).
Been burned enough that I always write it as CONST0_RTX (mode).  In the
case of an integer mode I guess it doesn't really matter.  I'll change it.

jeff

Reply via email to