Hi all,

The problem described here probably only affects targets whose ABI allow to 
pass structured
arguments of certain size via registers.

If the mode of the parameter type is BLKmode, in the callee, during RTL 
expanding,
a stack slot will be reserved for this parameter, and the incoming value will 
be copied into
the stack slot.

However, the stack slot for the parameter will not be aligned if the alignment 
of parameter type
exceeds MAX_SUPPORTED_STACK_ALIGNMENT.
Chances are, unaligned memory access might cause run-time errors.

For local variable on the stack, the alignment of the data type is honored,
although the document states that it is not guaranteed.

For example:

#include <stdint.h>
union U {
    uint32_t M0;
    uint32_t M1;
    uint32_t M2;
    uint32_t M3;
} __attribute((aligned(16)));

void tmp (union U *);
void foo (union U P0)
{
  union U P1 = P0;
  tmp (&P1);
}

The code-gen from armv7-a is like this:

foo:
    @ args = 0, pretend = 0, frame = 48
    @ frame_needed = 0, uses_anonymous_args = 0
    str    lr, [sp, #-4]!
    sub    sp, sp, #52
    mov    ip, sp
    stm    ip, {r0, r1, r2, r3}  --> ip is not 128-bit aligned
    add    lr, sp, #39
    bic    lr, lr, #15
    ldm    ip, {r0, r1, r2, r3}
    stm    lr, {r0, r1, r2, r3} --> lr is 128-bit aligned
    mov    r0, lr
    bl    tmp
    add    sp, sp, #52
    @ sp needed
    ldr    pc, [sp], #4

There are other obvious missed optimizations in the code-generation above.
The stack slot for parameter P0 and local variable P1 could be merged.
So that some of the load/store instructions could be removed.
I think this is a known missed optimization case.

To summaries, there are two issues here:
1, (wrong code) unaligned stack slot allocated for parameters during function 
expansion.
2, (missed optimization) stack slot for parameter sometimes is not necessary.
   In certain scenario, the argument register could directly be used.
   Currently, this is only possible when the parameter mode is not BLKmode.

For issue 1, we can do similar things as expand_used_vars.
Dynamically align the stack slot address for parameters whose alignment exceeds
PREDERRED_STACK_BOUNDARY. Other parameters could be store in gap between the
aligned address and fp when possible.

For issue 2, I checked the behavior of LLVM, it seems the stack slot allocation
for parameters are explicitly exposed by the alloca IR instruction at the very 
beginning.
Later, there are optimization/transformation passes like mem2reg, reg2mem, sroa 
etc. to remove
unnecessary alloca instructions.

In gcc, the stack allocation for parameters and local variables are done during 
expand pass, implicitly.
And RTL passes are not able to remove the unnecessary stack allocation and 
load/store operations.

For example:

uint32_t bar(union U P0)
{
  return P0.M0;
}

Currently, the code-gen is different on different targets.
There are various backend hooks which make the code-gen sub-optimal.
For example, aarch64 target could directly return with w0 while armv7-a target 
generates unnecessary
store and load.

However, this optimization should be target independent, unrelated target 
alignment configuration.
Both issue 1&2 could be resolved if gcc has a similar approach. But I assume 
the change is big.

Is there any suggestions for solving issue 1 and improving issue 2 in a generic 
way?
I can create a bugzilla ticket to record the issue.

Regards,
Renlin

Reply via email to