[Bug target/82267] x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?

hjl.tools at gmail dot com Fri, 22 Sep 2017 16:41:02 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267


H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|hjl at gcc dot gnu.org             |hjl.tools at gmail dot 
com

--- Comment #2 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Peter Cordes from comment #0)
> x32 defaults to using 32-bit address-size everywhere, it seems.  (Apparently
> introduced by rev 185396 for bug 50797, which introduced
> -maddress-mode=short and made it the default.)
> 
> This takes an extra 1-byte prefix on every instruction with a memory
> operand.  It's not just code-size; this is potentially a big throughput
> problem on Intel Silvermont where more than 3 prefixes (including mandatory
> prefixes and 0F escape bytes for SSE and other instructions) cause a stall. 
> These are exactly the systems where a memory-saving ABI might be most
> useful.  (I'm not building one, I just think x32 is a good idea if
> implemented optimally.)
> 
> long long doublederef(long long **p){
>         return **p;
> }
> //  https://godbolt.org/g/NHbURq
> gcc8 -mx32 -O3
>         movl    (%edi), %eax          # 0x67 prefix
>         movq    (%eax), %rax          # 0x67 prefix
>         ret
> 
> The second instruction is 1 byte longer for no reason: it needs a 0x67
> address-size prefix to encode.
> But we know for certain that the address is already zero-extended into %rax,
> because we just put it there.  Also, the ABI requires p to be zero-extended
> to 64 bits, so it would be safe to use `movl (%rdi), %eax` as the first
> instruction.
> 
> Even (%rsp) is avoided for some reason, even though -mx32 still uses
> push/pop/call/ret which use the full %rsp, so it has to be valid.
> 
> int stackuse(void) {
>         volatile int foo = 2;
>         return foo * 3;
> }
>         movl    $2, -4(%esp)            # 0x67 prefix
>         movl    -4(%esp), %eax          # 0x67 prefix

We can encode (%esp) as (%rsp) since the upper bits of RSP are zero.

>         leal    (%rax,%rax,2), %eax     # no prefixes
>         ret
> 
> 
> Compiling with -maddress-mode=long appears to generate optimal code for all
> the simple test cases I looked at, e.g.
> 
>         movl    $2, -4(%rsp)            # no prefixes
>         movl    -4(%rsp), %eax          # no prefixes
>         leal    (%rax,%rax,2), %eax     # no prefixes
>         ret
> 
> -maddress-mode=long still uses an address-size prefix instead of an LEA to
> make sure addresses wrap at 4G, and to ignore high garbage in registers:
> 
> long long fooi(long long *arr, int offset){
>         return arr[offset];
> }
>         movq    (%edi,%esi,8), %rax    # same for mode=short or long.
>         ret
> 
> Are there still cases where -maddress-mode=long makes worse code?


Yes, there are more places where -maddress-mode=long needs to zero-extend
address to 64 bits where 0x67 prefix does for you.

> ----
> 
> Is it really necessary for an unsigned offset to be wrap at 4G?  Does ISO C
> or GNU C guarantee that large unsigned values work like negative signed
> integers when used for pointer arithmetic?
> 
> // 64-bit offset so it won't have high garbage
> long long fooull(long long *arr, unsigned long long offset){
>         return arr[offset];
> }
> 
>         movq    (%edi,%esi,8), %rax    # but couldn't this be (%rdi,%rsi,8)
>         ret
> 
> Allowing 64-bit addressing modes with unsigned indexes could potentially
> save significant code-size, couldn't it?
> 
> address-mode=long already allows constant offsets to go outside 4G, for
> example:
> 
> foo_constant:         #    return arr[123456];
>         movq    987648(%rdi), %rax
>         ret
> 
> But it does treat the offset as signed, so 0xffffffffULL will  movq
> -8(%rdi), %rax.
> 
> The ABI doc (https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI) doesn't
> specify anything about C pointer-wrapping semantics, and I don't know where
> else to look to find out what behaviour is required/guaranteed and what is
> just how the current implementation happens to work.
> 
> Anyway, this is a side-track from the issue of not using address-size
> prefixes in single-pointer cases where it's already zero extended.
> 
> ---------
> 
> SSSE3 and later instructions need 66 0F 3A/38 before the opcode, so an
> address-size or REX prefix will cause a decode stall on Silvermont.  With

That is true.

> the default x32 behaviour, even SSE2 instructions (66 0F opcode) will cause
> decode stalls with a REX and address-size prefix.  e.g. paddb (%r8d), %xmm8 
> or even movdqa (but not movaps or other SSE1 instructions).  Fortunately KNL
> isn't really affected: VEX/EVEX is fine unless there's a segment prefix
> before it, but Agner Fog seems to be saying that other prefixes are fine.
> 
> In integer code, REX + operand-size + address-size + a 0F escape byte would
> be a problem for Silvermont/KNL, e.g. imul (%edi), %r10w needs all 4.  
> movbe %ax, (%edi) has 4 prefixes, including the 2 mandatory escape bytes: 67
> 66 0f 38 f1 07.
> 
> 
> In-order Atom also has "severe delays" (according to
> http://agner.org/optimize/) with more than 3 prefixes, but unlike
> Silvermont, that apparently doesn't include mandatory prefixes for SSE
> instructions.  Similarly, Bulldozer-family has a 3-prefix limit, but doesn't
> count escape bytes, and VEX only counts as 0 or 1 (for 2/3 byte VEX).

But 0x67 prefix is still better.

[Bug target/82267] x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?

Reply via email to