Yes. You can thank Intel for this.
Thank you Intel :)
With the introduction of SSE1, something had to change in order
to satisfy hardware constraints. Intel initially proposed some
scheme that performed dynamic stack alignment in functions that
use SSE1 instructions, and multiple entry points to avoid
redundant realignments.
Ok I am no compiler expert, so this may be totally impossible,
and if so I'd appreciate an education, but this is what I
instinctively thought of when first thinking about this problem.
There are a very limited number of instructions that require
16-byte alignment. The two main places you have to worry about
that alignment are when passing arguments to a function, and
local stack variables. I guess the compiler is safe to assume
that if you are using normal memory, say from a malloc() to
hold the alignment-sensitive data that you have done your
own alignment. So lets take the first case. You have some code
that is going to be passing some vector parameter, something
that is alignment-conscious. I am assuming the compiler knows
that. Before pushing the data onto the stack, the compiler
could arrange things such that those parameters would be
neatly aligned on a 16-byte boundary. The only assumption
that would need to hold true is that the called function was
also compiled with gcc. Using imaginary data types here, where
int128_t is alignment-sensitive, suppose we had:
int func (int128_t x, int y, int128_t z) {
}
int otherfunc (void) {
int128_t foo = 123;
int128_t bar = 234;
return func(foo, 0, bar);
}
When generating teh call to func(), could gcc not align the
stack to 16-bytes, such that the first argument is propperly
aligned. You then push the next argument, a simple 4-byte
, and then re-align to the next 16-byte boundary for the
third argument. the code in func() could take this scheme
into account, and know the exact offsets into the stack that
it needs to get to the args.
The second case is where you have function-local variables
that are alignment-constrained. In this case, wouldn't
simple analysis of the function contents determine whether
or not any alignment-specific insns are being used, and
if so, to automatically align the stack to 16 bytes if
there is. That way, only those functions that actually use
such insns pay the (small) penalty of rounding up the stack.
If cxourse, all of that could be consitional on targets
that don't always align functions on a 16-byte boundary.
This seems far less invasive that redefining an ABI.
GCC code will interoperate with other compilers if you don't use
the 128-bit vector modes, but if you do, then we *require* that
the stack be maintained aligned.
I think, and I may be wrong here, but I think if I simply
make sure that entry to main is correctly aligned, then
the majority of code will just work. Assuming I am compiling
some program with gcc, if main is correctly aligned, and all
gcc code goes to lengths to ensure that alignment, then the
only time it can get *out* of alignment is if gcc code has
made a call to non-aligned libc code, which in turn makes
calls back into gcc code (a la qsort, ftw, etc). Those cases
are relatively rare. The only other time it's likely to be
an issue is with signal delivery, and I am pretty certain I
can persuade the kernel folks to ensure that the stack frame
is always aligned to 16 bytes when that happens.
Kean