https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC|hjl at gcc dot gnu.org |hjl.tools at gmail dot
com
--- Comment #2 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Peter Cordes from comment #0)
> x32 defaults to using 32-bit address-size everywhere, it seems. (Apparently
> introduced by rev 185396 for bug 50797, which introduced
> -maddress-mode=short and made it the default.)
>
> This takes an extra 1-byte prefix on every instruction with a memory
> operand. It's not just code-size; this is potentially a big throughput
> problem on Intel Silvermont where more than 3 prefixes (including mandatory
> prefixes and 0F escape bytes for SSE and other instructions) cause a stall.
> These are exactly the systems where a memory-saving ABI might be most
> useful. (I'm not building one, I just think x32 is a good idea if
> implemented optimally.)
>
> long long doublederef(long long **p){
> return **p;
> }
> // https://godbolt.org/g/NHbURq
> gcc8 -mx32 -O3
> movl (%edi), %eax # 0x67 prefix
> movq (%eax), %rax # 0x67 prefix
> ret
>
> The second instruction is 1 byte longer for no reason: it needs a 0x67
> address-size prefix to encode.
> But we know for certain that the address is already zero-extended into %rax,
> because we just put it there. Also, the ABI requires p to be zero-extended
> to 64 bits, so it would be safe to use `movl (%rdi), %eax` as the first
> instruction.
>
> Even (%rsp) is avoided for some reason, even though -mx32 still uses
> push/pop/call/ret which use the full %rsp, so it has to be valid.
>
> int stackuse(void) {
> volatile int foo = 2;
> return foo * 3;
> }
> movl $2, -4(%esp) # 0x67 prefix
> movl -4(%esp), %eax # 0x67 prefix
We can encode (%esp) as (%rsp) since the upper bits of RSP are zero.
> leal (%rax,%rax,2), %eax # no prefixes
> ret
>
>
> Compiling with -maddress-mode=long appears to generate optimal code for all
> the simple test cases I looked at, e.g.
>
> movl $2, -4(%rsp) # no prefixes
> movl -4(%rsp), %eax # no prefixes
> leal (%rax,%rax,2), %eax # no prefixes
> ret
>
> -maddress-mode=long still uses an address-size prefix instead of an LEA to
> make sure addresses wrap at 4G, and to ignore high garbage in registers:
>
> long long fooi(long long *arr, int offset){
> return arr[offset];
> }
> movq (%edi,%esi,8), %rax # same for mode=short or long.
> ret
>
> Are there still cases where -maddress-mode=long makes worse code?
Yes, there are more places where -maddress-mode=long needs to zero-extend
address to 64 bits where 0x67 prefix does for you.
> ----
>
> Is it really necessary for an unsigned offset to be wrap at 4G? Does ISO C
> or GNU C guarantee that large unsigned values work like negative signed
> integers when used for pointer arithmetic?
>
> // 64-bit offset so it won't have high garbage
> long long fooull(long long *arr, unsigned long long offset){
> return arr[offset];
> }
>
> movq (%edi,%esi,8), %rax # but couldn't this be (%rdi,%rsi,8)
> ret
>
> Allowing 64-bit addressing modes with unsigned indexes could potentially
> save significant code-size, couldn't it?
>
> address-mode=long already allows constant offsets to go outside 4G, for
> example:
>
> foo_constant: # return arr[123456];
> movq 987648(%rdi), %rax
> ret
>
> But it does treat the offset as signed, so 0xffffffffULL will movq
> -8(%rdi), %rax.
>
> The ABI doc (https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI) doesn't
> specify anything about C pointer-wrapping semantics, and I don't know where
> else to look to find out what behaviour is required/guaranteed and what is
> just how the current implementation happens to work.
>
> Anyway, this is a side-track from the issue of not using address-size
> prefixes in single-pointer cases where it's already zero extended.
>
> ---------
>
> SSSE3 and later instructions need 66 0F 3A/38 before the opcode, so an
> address-size or REX prefix will cause a decode stall on Silvermont. With
That is true.
> the default x32 behaviour, even SSE2 instructions (66 0F opcode) will cause
> decode stalls with a REX and address-size prefix. e.g. paddb (%r8d), %xmm8
> or even movdqa (but not movaps or other SSE1 instructions). Fortunately KNL
> isn't really affected: VEX/EVEX is fine unless there's a segment prefix
> before it, but Agner Fog seems to be saying that other prefixes are fine.
>
> In integer code, REX + operand-size + address-size + a 0F escape byte would
> be a problem for Silvermont/KNL, e.g. imul (%edi), %r10w needs all 4.
> movbe %ax, (%edi) has 4 prefixes, including the 2 mandatory escape bytes: 67
> 66 0f 38 f1 07.
>
>
> In-order Atom also has "severe delays" (according to
> http://agner.org/optimize/) with more than 3 prefixes, but unlike
> Silvermont, that apparently doesn't include mandatory prefixes for SSE
> instructions. Similarly, Bulldozer-family has a 3-prefix limit, but doesn't
> count escape bytes, and VEX only counts as 0 or 1 (for 2/3 byte VEX).
But 0x67 prefix is still better.