https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038
Bug ID: 85038 Summary: x32: unnecessary address-size prefix when a pointer register is already zero-extended Product: gcc Version: 8.0.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Bug 82267 was fixed for RSP only. (Or interpreted narrowly as only being about RSP vs. ESP). This bug is about the general case of using address-size prefixes in cases where we could prove they're not needed. Either because out-of-bounds is UB so we don't care about wrap vs. going outside 4GiB, or (simpler) the single-register case when we know the pointer is already zero-extended. Maybe we want separate bugs to track parts of this that can be fixed with separate patches, but I won't consider this fixed until -mx32 emits optimal code for all the cases listed here. I realize this won't be any time soon, but it's still code-size (and thus indirectly performance) that gcc is leaving on the table. Being smarter about using 64-bit address-size is even more useful for AArch64 -mabi=ilp32, because it doesn't have 32-bit address-size overrrides, so it always costs an extra instruction every time we fail to prove that 64-bit is safe. (And AArch64 ILP32 may get more use than x32 these days). I intended this bug to be about x32, though. -------- Useless 0x67 address-size override prefixes hurt code-size and thus performance on everything, with more serious problems on some CPUs that have trouble with more than 3 prefixes (especially Silvermont). See Bug 82267 for the details which I won't repeat. We still have tons of useless 0x67 prefixes in the default -maddress-mode=short mode (for every memory operand other than RSP, or RIP-relative), and -maddress-mode=long has lots of missed optimizations resulting in wasted LEA instructions, so neither one is good. float doublederef(float **p){ return **p; } // https://godbolt.org/g/exb74t // gcc 8.0.1 (trunk) -O3 -mx32 -march=haswell -maddress-mode=short movl (%edi), %eax vmovss (%eax), %xmm0 # could/should be (%rax) ret -maddress-mode=long gets that right, using (%rax), and also (%rdi) because the ABI doc specifies that x32 passes pointers zero-extended. mode=short still ensures that, so failure to take advantage is still a missed-opt. Note that clang -mx32 violates that ABI guarantee by compiling pass_arg(unsigned long long ptr) { ext_func((void*)ptr); } to just a tailcall (while gcc does zero-extend). See output in the godbolt link above. IDK if we care about being bug-compatible with clang for that corner case for this rare ABI, though. A less contrived case would be a struct arg or return value packed into a register passed on as just a pointer. ----- // arr+offset*4 is strictly within the low 32 bits because of range limits float safe_offset(float *arr, unsigned offset){ unsigned tmp = (unsigned)arr; arr = (void*)(tmp & -4096); // round down to a page offset &= 0xf; return arr[offset]; } // on the above godbolt link #mode=short andl $-4096, %edi andl $15, %esi vmovss (%edi,%esi,4), %xmm0 # (%rdi,%rsi,4) would have been safe, but that's maybe not worth looking for. # most cases have less pointer alignment than offset range #mode=long andl $-4096, %edi andl $15, %esi leal (%rdi,%rsi,4), %eax vmovss (%eax), %xmm0 # 32-bit addrmode after using a separate LEA So mode=long is just braindead here. It gets the worst of both worlds, using a separate LEA but then not taking advantage of the zero-extended pointer. The only way this could be worse is the LEA operand-size was 64-bit. Without the masking, both modes just use vmovss (%edi,%esi,4), %xmm0, but the extra operations defeat mode=long's attempts to recognize this case, and it picks an LEA instead of (or as well as?!?) an address-size prefix. ------- With a 64-bit offset, and a pointer that's definitely zero-extended to 64 bits: // same for signed or unsigned float ptr_and_offset_zext(float **p, unsigned long long offset){ float *arr = *p; return arr[offset]; } # mode=short movl (%edi), %eax # mode=long uses (%rdi) here vmovss (%eax,%esi,4), %xmm0 # but still 32-bit here. ret Why are we using address-size prefixes to stop a base+index from going outside 4G on out of bounds UB? (%rax,%rsi,4) should work for a signed / unsigned 64-bit offset when the pointer is known to be zero-extended. ISO C11 says that pointer+integer produces a result of pointer type, with UB if the result goes outside the array. It does *not* say that the integer has to be truncated to pointer width *first* > n1570 6.5.6 Additive operators, point 8: > ... > If both the pointer operand and the result point to elements of the same > array object, or one past the last element of the array object, the > evaluation shall not produce an overflow; > **otherwise, the behavior is undefined.** So it's perfectly valid to use all the bits of a wide integer array index, because it is UB if the high bits take the result outside of any object, even if truncating the input or the result to 32 bits would have produced a valid pointer. This allows optimizations with 64-bit offsets, and with 32-bit or narrower offsets that are correctly extended to 64 bits. (But I suspect that gcc internals makes it hard to take advantage, if we don't have a concept of "correctly extended to 64 bits in a register" before we lose the signed vs. unsigned info.) Richard Biener pointed out (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267#c1) that although wrap-around of pointer-math is not required, RTL doesn't know whether 32-bit offsets are signed or unsigned and thus has to consider the case of a signed 32-bit offset (where 64-bit address size with 32-bit signed values zero-extended in 64-bit registers wouldn't work). But here the pointer and offset are already *64* bits (with mode=long) or correctly extended to 64 (with mode=short), so (%eax,%esi,4) is the same effective-address as (%rax,%rsi,4) for any offset that doesn't cause UB by going outside an object. i.e. we can assume that array + offset fits in 32 bits, even if offset is a 64-bit negative integer (if array is a pointer zero-extended to 64-bit). If the resulting 64-bit address is outside the low 32, it was UB because we know there are no objects there, so we don't have to care about such inputs. Thus it doesn't matter whether we use 64-bit address size and let addressing mode generate a valid address with the upper 32 bits zero, or whether we truncate the address calculation to 32 bits. The only difference will be for out-of-bounds offset values, which could wrap back to a valid address on truncation to 32 bits, instead of faulting on an attempt to access far beyond the end of an array. ---- Estimate of the code-size impact: maybe 4% machine-code size for pointer-heavy code like gcc's own cc1 executable. Looking at a binary compiled with -m64, how much worse would it be with -mx32. (Arch Linux doesn't support x32, so I don't have any binaries sitting around.) objdump -drwC -Mintel --section=.text /usr/lib/gcc/x86_64-pc-linux-gnu/7.3.0/cc1 | egrep -v '^[^ ]|^$|\Wnop\W|\Wlea\W' | egrep ' .*(\[r[^is]|\[rsi)' -c 545101 instructions with a memory operand that gcc -mx32 would use a prefix for, and thus 545101 bytes of addr32 prefixes. (out of ~3380113 total instructions in the .text section) The first grep filters out non-instruction lines, and NOP / LEA. The 2nd grep counts matches for register addressing modes other than [rip+... and [rsp+... which current gcc knows not to use an address-size prefix for. On the over-optimistic assumption that every address-size prefix could be avoided, this 23MiB compiler executable (from gcc7.3 on Arch Linux) would have ~0.5MiB of address-size prefixes, or ~2% of the total size of the executable (which I think is mostly code, not data). Or 4% of the .text section (13695266 bytes). Not accounting for getting smaller from fewer REX prefixes, which is an error in the other direction from assuming that every addr32 can be avoided. ---- There are two approaches to improve the situation: * teach -maddress-mode=long to use 32-bit addressing modes instead of extra instructions whenever it can't prove that a 64-bit address-size is safe, and make it the default mode. * teach -maddress-mode=short to look for cases where it *can* prove that 64-bit address-size is safe, and omit the 0x67 prefix in that case. Are either of these feasible? Is gcc just completely not designed for ILP32 ABIs on 64-bit CPUs? One possible peephole is RBP+constant addressing modes when RBP is a frame pointer. (related: mode=long uses mov %rsp, %rbp instead of mov %esp, %ebp). Is info on whether RBP is a frame pointer or not available at the same point the RSP peephole check is done? If not, then the implementation would have to be very different. But really -mx32 should at least avoid prefixes on single-register addressing modes. That should be easy to prove correct without reasoning about negative integers. It's *very* common that a pointer in a register is already zero-extended.