Re: [PATCH] Fix alignment of stack slots for overaligned types [PR103500]

H.J. Lu via Gcc-patches Fri, 07 Jan 2022 07:25:37 -0800

On Fri, Jan 7, 2022 at 7:20 AM Alex Coplan via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi Richard,
>
> Thanks for the review.
>
> On 20/12/2021 13:19, Richard Sandiford wrote:
> > Alex Coplan via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > Hi,
> > >
> > > This fixes PR103500 i.e. ensuring that stack slots for
> > > passed-by-reference overaligned types are appropriately aligned. For the
> > > testcase:
> > >
> > > typedef struct __attribute__((aligned(32))) {
> > >   long x,y;
> > > } S;
> > > S x;
> > > void f(S);
> > > void g(void) { f(x); }
> > >
> > > on AArch64, we currently generate (at -O2):
> > >
> > > g:
> > >         adrp    x1, .LANCHOR0
> > >         add     x1, x1, :lo12:.LANCHOR0
> > >         stp     x29, x30, [sp, -48]!
> > >         mov     x29, sp
> > >         ldp     q0, q1, [x1]
> > >         add     x0, sp, 16
> > >         stp     q0, q1, [sp, 16]
> > >         bl      f
> > >         ldp     x29, x30, [sp], 48
> > >         ret
> > >
> > > so the stack slot for the passed-by-reference copy of the structure is
> > > at sp + 16, and the sp is only guaranteed to be 16-byte aligned, so the
> > > structure is only 16-byte aligned. The PCS requires the structure to be
> > > 32-byte aligned. After this patch, we generate:
> > >
> > > g:
> > >         adrp    x1, .LANCHOR0
> > >         add     x1, x1, :lo12:.LANCHOR0
> > >         stp     x29, x30, [sp, -64]!
> > >         mov     x29, sp
> > >         add     x0, sp, 47
> > >         ldp     q0, q1, [x1]
> > >         and     x0, x0, -32
> > >         stp     q0, q1, [x0]
> > >         bl      f
> > >         ldp     x29, x30, [sp], 64
> > >         ret
> > >
> > > i.e. we ensure 32-byte alignment for the struct.
> > >
> > > The approach taken here is similar to that in
> > > function.c:assign_parm_setup_block where it handles the case for
> > > DECL_ALIGN (parm) > MAX_SUPPORTED_STACK_ALIGNMENT. This in turn is
> > > similar to the approach taken in cfgexpand.c:expand_stack_vars (where
> > > the function calls get_dynamic_stack_size) which is the code that
> > > handles the alignment for overaligned structures as addressable local
> > > variables (see the related case discussed in the PR).
> >
> > A difference with the latter is that cfgexpand (AFAICT) always
> > honours the DECL/TYPE_ALIGN, with LOCAL_DECL_ALIGNMENT supposedly
> > only increasing the alignment for efficiency reasons (rather than
> > decreasing it).
> >
> > So…
> >
> > > This patch also updates the aapcs64 test mentioned in the PR to avoid
> > > the frontend folding away the alignment check. I've confirmed that the
> > > execution test actually fails on aarch64-linux-gnu prior to the patch
> > > being applied and passes afterwards.
> > >
> > > Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
> > > arm-linux-gnueabihf: no regressions.
> > >
> > > I'd appreciate any feedback. Is it OK for trunk?
> > >
> > > Thanks,
> > > Alex
> > >
> > > gcc/ChangeLog:
> > >
> > >     PR middle-end/103500
> > >     * function.c (get_stack_local_alignment): Align BLKmode overaligned
> > >     types to the alignment required by the type.
> > >     (assign_stack_temp_for_type): Handle BLKmode overaligned stack
> > >     slots by allocating a larger-than-necessary buffer and aligning
> > >     the address within appropriately.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >     PR middle-end/103500
> > >     * gcc.target/aarch64/aapcs64/rec_align-8.c (test_pass_by_ref):
> > >     Prevent the frontend from folding our alignment check away by
> > >     using snprintf to store the pointer into a string and recovering
> > >     it with sscanf.
> > >
> > > diff --git a/gcc/function.c b/gcc/function.c
> > > index 61b3bd036b8..5ed722ab959 100644
> > > --- a/gcc/function.c
> > > +++ b/gcc/function.c
> > > @@ -278,7 +278,9 @@ get_stack_local_alignment (tree type, machine_mode 
> > > mode)
> > >    unsigned int alignment;
> > >
> > >    if (mode == BLKmode)
> > > -    alignment = BIGGEST_ALIGNMENT;
> > > +    alignment = (type && TYPE_ALIGN (type) > 
> > > MAX_SUPPORTED_STACK_ALIGNMENT)
> > > +      ? TYPE_ALIGN (type)
> > > +      : BIGGEST_ALIGNMENT;
> >
> > …I'm not sure about this calculation.  Why do we only honour TYPE_ALIGN
> > if it's greater than MAX_SUPPORTED_STACK_ALIGNMENT, and fall all the
> > way back to BIGGEST_ALIGNMENT otherwise?  It looks like on nvptx
> > this would have the effect of honouring (say) 2048-byte alignment,
> > but not 32-byte alignment (which falls between BIGGEST_ALIGNMENT
> > and MAX_SUPPORTED_STACK_ALIGNMENT).
>
> So, to be honest, this was a bit of a bodge to try and work around
> issues on x86. My original attempt at solving the PR used the more
> obvious calculation you suggest below, i.e. the max of BIGGEST_ALIGNMENT
> and TYPE_ALIGN (type). The problem with that is, on x86,
> MAX_SUPPORTED_STACK_ALIGNMENT has a huge value (2^31).


On x86, stack alignment limit is the limit of stack.   You can
align stack to any values on x86 as long as your stack allows it.

> explow.c:get_dynamic_stack_size has:
>
>   if (size_align % MAX_SUPPORTED_STACK_ALIGNMENT != 0)
>     {
>       size = round_push (size);
>       [...]
>     }
>   *psize = size;
>
> so inevitably we end up calling round_push on x86 which in turn ends up
> going down the else branch:
>
>   else
>     {
>       /* If crtl->preferred_stack_boundary might still grow, use
>          virtual_preferred_stack_boundary_rtx instead.  This will be
>          substituted by the right value in vregs pass and optimized
>          during combine.  */
>       align_rtx = virtual_preferred_stack_boundary_rtx;
>       alignm1_rtx = force_operand (plus_constant (Pmode, align_rtx, -1),
>                                    NULL_RTX);
>     }
>
> and we end up returning a size expression involving
> virtual_preferred_stack_boundary_rtx, which clearly cannot be a
> CONST_INT (required for the approach in the patch to work: we need a
> constant to pass to assign_stack_local_1).
>
> So the idea with that calculation was to only use the new, larger
> alignment, if it was larger than MAX_SUPPORTED_STACK_ALIGNMENT (which
> should never be the case on x86 for any reasonable value of TYPE_ALIGN).
> Otherwise, it would fall back to the old behaviour of just using
> BIGGEST_ALIGNMENT.
>
> I realise that this is a rather inelegant hack, but I wasn't sure how
> else to approach the problem.
>
> >
> > Also, what about the non-BLKmode case?  A type's mode cannot be more
> > aligned than the type, but a type is allowed to be more aligned than
> > its mode.  E.g.:
>
> Yes, we should probably try and handle the non-BLKmode case as well. I
> was just trying to avoid changing too much at once at the risk of
> introducing regressions.
>
> >
> > typedef unsigned int V __attribute__((vector_size(32)));
> > typedef struct __attribute__((aligned(32))) { V x; } S;
> > S x;
> > void f(S);
> > void g(void) { f(x); }
> >
> > doesn't seem to be handled by the patch.  It's possible to create
> > variations on this on aarch64 up to 2048 bits using SVE and
> > -msve-vector-bits=N.
> >
> > Perhaps we should treat the old alignment calculation as a minimum
> > and take the max of that and TYPE_ALIGN, where TYPE_ALIGN is given.
>
> As discussed above, this doesn't work on x86, unfortunately. If you have
> any alternative suggestions as to how to approach the problem, it would
> be great to hear them.
>
> Thanks,
> Alex
>
> >
> > >    else
> > >      alignment = GET_MODE_ALIGNMENT (mode);
> > >
> > > @@ -872,21 +874,35 @@ assign_stack_temp_for_type (machine_mode mode, 
> > > poly_int64 size, tree type)
> > >
> > >        p = ggc_alloc<temp_slot> ();
> > >
> > > -      /* We are passing an explicit alignment request to 
> > > assign_stack_local.
> > > -    One side effect of that is assign_stack_local will not round SIZE
> > > -    to ensure the frame offset remains suitably aligned.
> > > -
> > > -    So for requests which depended on the rounding of SIZE, we go ahead
> > > -    and round it now.  We also make sure ALIGNMENT is at least
> > > -    BIGGEST_ALIGNMENT.  */
> > > -      gcc_assert (mode != BLKmode || align == BIGGEST_ALIGNMENT);
> > > -      p->slot = assign_stack_local_1 (mode,
> > > -                                 (mode == BLKmode
> > > -                                  ? aligned_upper_bound (size,
> > > -                                                         (int) align
> > > -                                                         / BITS_PER_UNIT)
> > > -                                  : size),
> > > -                                 align, 0);
> > > +      if (mode == BLKmode && align > MAX_SUPPORTED_STACK_ALIGNMENT)
> > > +   {
> > > +     rtx allocsize = gen_int_mode (size, Pmode);
> > > +     get_dynamic_stack_size (&allocsize, 0, align, NULL);
> > > +     gcc_assert (CONST_INT_P (allocsize));
> > > +     size = UINTVAL (allocsize);
> >
> > Since the size coming in is a poly_int64, I think we should treat
> > the result in the same way, using rtx_to_poly_int64 rather than
> > requiring a CONST_INT.  We might not handle that case properly yet,
> > but rtx_to_poly_int64 would trap mistakes, and using it now would
> > avoid fewer surprises later.
> >
> > Thanks,
> > Richard
> >
> > > +     p->slot = assign_stack_local_1 (mode,
> > > +                                     size,
> > > +                                     BIGGEST_ALIGNMENT, 0);
> > > +     rtx addr = align_dynamic_address (XEXP (p->slot, 0), align);
> > > +     mark_reg_pointer (addr, align);
> > > +     p->slot = gen_rtx_MEM (GET_MODE (p->slot), addr);
> > > +     MEM_NOTRAP_P (p->slot) = 1;
> > > +   }
> > > +      else
> > > +   /* We are passing an explicit alignment request to assign_stack_local.
> > > +      One side effect of that is assign_stack_local will not round SIZE
> > > +      to ensure the frame offset remains suitably aligned.
> > > +
> > > +      So for requests which depended on the rounding of SIZE, we go ahead
> > > +      and round it now.  We also make sure ALIGNMENT is at least
> > > +      BIGGEST_ALIGNMENT.  */
> > > +   p->slot = assign_stack_local_1 (mode,
> > > +                                   (mode == BLKmode
> > > +                                    ? aligned_upper_bound (size,
> > > +                                                           (int) align
> > > +                                                           / 
> > > BITS_PER_UNIT)
> > > +                                    : size),
> > > +                                   align, 0);
> > >
> > >        p->align = align;
> > >
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c 
> > > b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > > index c9351802191..0b0ac0b9b97 100644
> > > --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > > +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > > @@ -21,10 +21,18 @@ test_pass_by_ref (int x0, overaligned x1, int x2)
> > >      abort ();
> > >    if (memcmp ((void *) &x1, (void *)&a, sizeof (overaligned)))
> > >      abort ();
> > > -  long addr = ((long) &x1) & 31;
> > > +
> > > +  char buf[64];
> > > +  void *ptr;
> > > +
> > > +  __builtin_snprintf (buf, sizeof(buf), "%p", &x1);
> > > +  if (__builtin_sscanf(buf, "%p", &ptr) != 1)
> > > +    abort ();
> > > +
> > > +  long addr = ((long) ptr) & 31;
> > >    if (addr != 0)
> > >      {
> > > -      __builtin_printf ("Alignment was %d\n", addr);
> > > +      __builtin_printf ("Alignment was %ld\n", addr);
> > >        abort ();
> > >      }
> > >  }



-- 
H.J.

Re: [PATCH] Fix alignment of stack slots for overaligned types [PR103500]

Reply via email to