Re: [PATCH] Fix alignment of stack slots for overaligned types [PR103500]

Alex Coplan via Gcc-patches Fri, 07 Jan 2022 07:20:07 -0800

Hi Richard,

Thanks for the review.


On 20/12/2021 13:19, Richard Sandiford wrote:
> Alex Coplan via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Hi,
> >
> > This fixes PR103500 i.e. ensuring that stack slots for
> > passed-by-reference overaligned types are appropriately aligned. For the
> > testcase:
> >
> > typedef struct __attribute__((aligned(32))) {
> >   long x,y;
> > } S;
> > S x;
> > void f(S);
> > void g(void) { f(x); }
> >
> > on AArch64, we currently generate (at -O2):
> >
> > g:
> >         adrp    x1, .LANCHOR0
> >         add     x1, x1, :lo12:.LANCHOR0
> >         stp     x29, x30, [sp, -48]!
> >         mov     x29, sp
> >         ldp     q0, q1, [x1]
> >         add     x0, sp, 16
> >         stp     q0, q1, [sp, 16]
> >         bl      f
> >         ldp     x29, x30, [sp], 48
> >         ret
> >
> > so the stack slot for the passed-by-reference copy of the structure is
> > at sp + 16, and the sp is only guaranteed to be 16-byte aligned, so the
> > structure is only 16-byte aligned. The PCS requires the structure to be
> > 32-byte aligned. After this patch, we generate:
> >
> > g:
> >         adrp    x1, .LANCHOR0
> >         add     x1, x1, :lo12:.LANCHOR0
> >         stp     x29, x30, [sp, -64]!
> >         mov     x29, sp
> >         add     x0, sp, 47
> >         ldp     q0, q1, [x1]
> >         and     x0, x0, -32
> >         stp     q0, q1, [x0]
> >         bl      f
> >         ldp     x29, x30, [sp], 64
> >         ret
> >
> > i.e. we ensure 32-byte alignment for the struct.
> >
> > The approach taken here is similar to that in
> > function.c:assign_parm_setup_block where it handles the case for
> > DECL_ALIGN (parm) > MAX_SUPPORTED_STACK_ALIGNMENT. This in turn is
> > similar to the approach taken in cfgexpand.c:expand_stack_vars (where
> > the function calls get_dynamic_stack_size) which is the code that
> > handles the alignment for overaligned structures as addressable local
> > variables (see the related case discussed in the PR).
> 
> A difference with the latter is that cfgexpand (AFAICT) always
> honours the DECL/TYPE_ALIGN, with LOCAL_DECL_ALIGNMENT supposedly
> only increasing the alignment for efficiency reasons (rather than
> decreasing it).
> 
> So…
> 
> > This patch also updates the aapcs64 test mentioned in the PR to avoid
> > the frontend folding away the alignment check. I've confirmed that the
> > execution test actually fails on aarch64-linux-gnu prior to the patch
> > being applied and passes afterwards.
> >
> > Bootstrapped and regtested on aarch64-linux-gnu, x86_64-linux-gnu, and
> > arm-linux-gnueabihf: no regressions.
> >
> > I'd appreciate any feedback. Is it OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> >     PR middle-end/103500
> >     * function.c (get_stack_local_alignment): Align BLKmode overaligned
> >     types to the alignment required by the type.
> >     (assign_stack_temp_for_type): Handle BLKmode overaligned stack
> >     slots by allocating a larger-than-necessary buffer and aligning
> >     the address within appropriately.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR middle-end/103500
> >     * gcc.target/aarch64/aapcs64/rec_align-8.c (test_pass_by_ref):
> >     Prevent the frontend from folding our alignment check away by
> >     using snprintf to store the pointer into a string and recovering
> >     it with sscanf.
> >
> > diff --git a/gcc/function.c b/gcc/function.c
> > index 61b3bd036b8..5ed722ab959 100644
> > --- a/gcc/function.c
> > +++ b/gcc/function.c
> > @@ -278,7 +278,9 @@ get_stack_local_alignment (tree type, machine_mode mode)
> >    unsigned int alignment;
> >  
> >    if (mode == BLKmode)
> > -    alignment = BIGGEST_ALIGNMENT;
> > +    alignment = (type && TYPE_ALIGN (type) > MAX_SUPPORTED_STACK_ALIGNMENT)
> > +      ? TYPE_ALIGN (type)
> > +      : BIGGEST_ALIGNMENT;
> 
> …I'm not sure about this calculation.  Why do we only honour TYPE_ALIGN
> if it's greater than MAX_SUPPORTED_STACK_ALIGNMENT, and fall all the
> way back to BIGGEST_ALIGNMENT otherwise?  It looks like on nvptx
> this would have the effect of honouring (say) 2048-byte alignment,
> but not 32-byte alignment (which falls between BIGGEST_ALIGNMENT
> and MAX_SUPPORTED_STACK_ALIGNMENT).

So, to be honest, this was a bit of a bodge to try and work around
issues on x86. My original attempt at solving the PR used the more
obvious calculation you suggest below, i.e. the max of BIGGEST_ALIGNMENT
and TYPE_ALIGN (type). The problem with that is, on x86,
MAX_SUPPORTED_STACK_ALIGNMENT has a huge value (2^31).
explow.c:get_dynamic_stack_size has:

  if (size_align % MAX_SUPPORTED_STACK_ALIGNMENT != 0)
    {
      size = round_push (size);
      [...]
    }
  *psize = size;

so inevitably we end up calling round_push on x86 which in turn ends up
going down the else branch:

  else
    {
      /* If crtl->preferred_stack_boundary might still grow, use
         virtual_preferred_stack_boundary_rtx instead.  This will be
         substituted by the right value in vregs pass and optimized
         during combine.  */
      align_rtx = virtual_preferred_stack_boundary_rtx;
      alignm1_rtx = force_operand (plus_constant (Pmode, align_rtx, -1),
                                   NULL_RTX);
    }

and we end up returning a size expression involving
virtual_preferred_stack_boundary_rtx, which clearly cannot be a
CONST_INT (required for the approach in the patch to work: we need a
constant to pass to assign_stack_local_1).

So the idea with that calculation was to only use the new, larger
alignment, if it was larger than MAX_SUPPORTED_STACK_ALIGNMENT (which
should never be the case on x86 for any reasonable value of TYPE_ALIGN).
Otherwise, it would fall back to the old behaviour of just using
BIGGEST_ALIGNMENT.

I realise that this is a rather inelegant hack, but I wasn't sure how
else to approach the problem.

> 
> Also, what about the non-BLKmode case?  A type's mode cannot be more
> aligned than the type, but a type is allowed to be more aligned than
> its mode.  E.g.:

Yes, we should probably try and handle the non-BLKmode case as well. I
was just trying to avoid changing too much at once at the risk of
introducing regressions.

> 
> typedef unsigned int V __attribute__((vector_size(32)));
> typedef struct __attribute__((aligned(32))) { V x; } S;
> S x;
> void f(S);
> void g(void) { f(x); }
> 
> doesn't seem to be handled by the patch.  It's possible to create
> variations on this on aarch64 up to 2048 bits using SVE and
> -msve-vector-bits=N.
> 
> Perhaps we should treat the old alignment calculation as a minimum
> and take the max of that and TYPE_ALIGN, where TYPE_ALIGN is given.

As discussed above, this doesn't work on x86, unfortunately. If you have
any alternative suggestions as to how to approach the problem, it would
be great to hear them.

Thanks,
Alex

> 
> >    else
> >      alignment = GET_MODE_ALIGNMENT (mode);
> >  
> > @@ -872,21 +874,35 @@ assign_stack_temp_for_type (machine_mode mode, 
> > poly_int64 size, tree type)
> >  
> >        p = ggc_alloc<temp_slot> ();
> >  
> > -      /* We are passing an explicit alignment request to 
> > assign_stack_local.
> > -    One side effect of that is assign_stack_local will not round SIZE
> > -    to ensure the frame offset remains suitably aligned.
> > -
> > -    So for requests which depended on the rounding of SIZE, we go ahead
> > -    and round it now.  We also make sure ALIGNMENT is at least
> > -    BIGGEST_ALIGNMENT.  */
> > -      gcc_assert (mode != BLKmode || align == BIGGEST_ALIGNMENT);
> > -      p->slot = assign_stack_local_1 (mode,
> > -                                 (mode == BLKmode
> > -                                  ? aligned_upper_bound (size,
> > -                                                         (int) align
> > -                                                         / BITS_PER_UNIT)
> > -                                  : size),
> > -                                 align, 0);
> > +      if (mode == BLKmode && align > MAX_SUPPORTED_STACK_ALIGNMENT)
> > +   {
> > +     rtx allocsize = gen_int_mode (size, Pmode);
> > +     get_dynamic_stack_size (&allocsize, 0, align, NULL);
> > +     gcc_assert (CONST_INT_P (allocsize));
> > +     size = UINTVAL (allocsize);
> 
> Since the size coming in is a poly_int64, I think we should treat
> the result in the same way, using rtx_to_poly_int64 rather than
> requiring a CONST_INT.  We might not handle that case properly yet,
> but rtx_to_poly_int64 would trap mistakes, and using it now would
> avoid fewer surprises later.
> 
> Thanks,
> Richard
> 
> > +     p->slot = assign_stack_local_1 (mode,
> > +                                     size,
> > +                                     BIGGEST_ALIGNMENT, 0);
> > +     rtx addr = align_dynamic_address (XEXP (p->slot, 0), align);
> > +     mark_reg_pointer (addr, align);
> > +     p->slot = gen_rtx_MEM (GET_MODE (p->slot), addr);
> > +     MEM_NOTRAP_P (p->slot) = 1;
> > +   }
> > +      else
> > +   /* We are passing an explicit alignment request to assign_stack_local.
> > +      One side effect of that is assign_stack_local will not round SIZE
> > +      to ensure the frame offset remains suitably aligned.
> > +
> > +      So for requests which depended on the rounding of SIZE, we go ahead
> > +      and round it now.  We also make sure ALIGNMENT is at least
> > +      BIGGEST_ALIGNMENT.  */
> > +   p->slot = assign_stack_local_1 (mode,
> > +                                   (mode == BLKmode
> > +                                    ? aligned_upper_bound (size,
> > +                                                           (int) align
> > +                                                           / BITS_PER_UNIT)
> > +                                    : size),
> > +                                   align, 0);
> >  
> >        p->align = align;
> >  
> > diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c 
> > b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > index c9351802191..0b0ac0b9b97 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
> > @@ -21,10 +21,18 @@ test_pass_by_ref (int x0, overaligned x1, int x2)
> >      abort ();
> >    if (memcmp ((void *) &x1, (void *)&a, sizeof (overaligned)))
> >      abort ();
> > -  long addr = ((long) &x1) & 31;
> > +
> > +  char buf[64];
> > +  void *ptr;
> > +
> > +  __builtin_snprintf (buf, sizeof(buf), "%p", &x1);
> > +  if (__builtin_sscanf(buf, "%p", &ptr) != 1)
> > +    abort ();
> > +
> > +  long addr = ((long) ptr) & 31;
> >    if (addr != 0)
> >      {
> > -      __builtin_printf ("Alignment was %d\n", addr);
> > +      __builtin_printf ("Alignment was %ld\n", addr);
> >        abort ();
> >      }
> >  }

Re: [PATCH] Fix alignment of stack slots for overaligned types [PR103500]

Reply via email to