On 12/12/2021 12:57 PM, Jan Hubicka wrote:

On 12/12/2021 3:49 AM, Jan Hubicka via Gcc-patches wrote:
Hi,
As discussed in the PR, we miss some optimization becuase
gimple-ssa-isolate-paths turns NULL memory accesses to volatile and adds
__builtin_trap after them.  This is seen as a side-effect by IPA analysis
and additionally the (fully unreachable) builtin_trap is believed to load
all global memory.

I think we should think of less intrusive gimple representation of this, but
it is also easy enough to special case that in IPA analysers as done in
this patch.  This is a win even if we improve the representation since
gimple-ssa-isolate-paths is run late and this way we improve optimization
early.
So what's important about the IR representation is that we keep some kind of
memory access around (so that it will fault), that the memory access
reference a minimal amount of other stuff (we don't want the address
computation to keep anything live for example) and that we have a subsequent
trap so that the CFG routines know it traps.

Otherwise we're free to do whatever we want.
I was thinking about __builtin_trap_load (ptr) and __builtin_trap_store (ptr)
which we could annotate correctly for PTA and avoid need for two
statements, but I am not sure if this is good idea.
Seems reasonable.  I'm not sure if we need both though. __builtin_trap_memref(ptr) perhaps?



coioncidentally I just ran across another artifact of this.   One of
hottest functions in clang in getTerminator. Clang builds it as:

        │    000000000122f4b0 <llvm::BasicBlock::getTerminator() const>:
        │    llvm::BasicBlock::getTerminator() const:
   2.16 │      mov    0x28(%rdi),%rax
  28.85 │      add    $0x28,%rdi
   0.20 │      cmp    %rax,%rdi
   3.55 │    ↓ je     29
        │      lea    -0x18(%rax),%rcx
   0.79 │      test   %rax,%rax
        │      cmove  %rax,%rcx
   4.15 │      movzbl 0x10(%rcx),%edx
  48.46 │      add    $0xffffffe5,%edx
   3.95 │      xor    %eax,%eax
   0.20 │      cmp    $0xb,%edx
   3.35 │      cmovb  %rcx,%rax
   4.33 │    ← ret
        │29:   xor    %eax,%eax
        │    ← ret

While we do:

        │
        │    0000000001471840 <llvm::BasicBlock::getTerminator() const>:
        │    llvm::BasicBlock::getTerminator() const:
   3.24 │      mov    0x28(%rdi),%rax
  31.31 │      add    $0x28,%rdi
   0.59 │      cmp    %rdi,%rax
   5.31 │    ↓ je     30
        │      test   %rax,%rax
   2.36 │    → je     9366f4 <llvm::BasicBlock::getTerminator() const [clone 
.cold]>
        │      movzbl -0x8(%rax),%edx
  45.73 │      sub    $0x18,%rax
        │      sub    $0x1b,%edx
   4.70 │      cmp    $0xb,%edx
   3.53 │      mov    $0x0,%edx
        │      cmovae %rdx,%rax
   3.23 │    ← ret
        │      xchg   %ax,%ax
        │30:   xor    %eax,%eax
        │    ← ret

....

        │
        │   00000000009366f4 <llvm::BasicBlock::getTerminator() const [clone 
.cold]>:
        │   llvm::BasicBlock::getTerminator() const [clone .cold]:
        │     movzbl 0x10,%eax
        │     ud2

The clang generated code obviously conditionally loads NULL pointer, but
replacing this cmov by conditional jump to cold section seems overkill.
Loading 10 into eax seems useless...
The load into %eax definitely seems useless.  Given branch prediction I'm less sure about the cmov vs jump to the cold section though, but I don't have strong opinions.  If you think doing the cmov is a better choice, I won't argue.

jeff


Reply via email to