Re: using scratchpads to enhance RTL-level if-conversion: revised patch

Bernd Schmidt Wed, 14 Oct 2015 12:16:46 -0700

On 10/14/2015 07:43 PM, Jeff Law wrote:

Obviously some pessimization relative to current code is necessary to
fix some of the problems WRT thread safety and avoiding things like
introducing faults in code which did not previously fault.

Huh? This patch is purely an (attempt at) optimization, not somethingthat fixes any problems.

However, pessimization of safe code is, err, um, bad and needs to be
avoided.


Here's an example:

                                  >         subq    $16, %rsp
[...]
                                  >         leaq    8(%rsp), %r8
                                  >         leaq    256(%rax), %rdx
    cmpq    256(%rax), %rcx       |         cmpq    256(%rax), %rsi
    jne    .L97                   <
    movq    $0, 256(%rax)         <
.L97:                             <
                                  >         movq    %rdx, %rax
                                  >         cmovne  %r8, %rax
                                  >         movq    $0, (%rax)
[...]
                                  >         addq    $16, %rsp

In the worst case that executes six more instructions, and always causesunnecessary stack frame bloat. This on x86 where AFAIK it's doubtfulwhether cmov is a win at all anyway. I think this shows the approach isjust bad, even ignoring problems like that it could allocate multiplescratchpads when one would suffice, or allocate one and end up not usingit because the transformation fails.

I can't test valgrind right now, it fails to run on my machine, but Iguess it could adapt to allow stores slightly below the stack (maybewarning once)? It seems like a bit of an edge case to worry about, butif supporting it is critical and it can't be changed to adapt to newoptimizations, then I think we're probably better off entirely withoutthis scratchpad transformation.

Alternatively I can think of a few other possible approaches whichwouldn't require this kind of bloat:

 * add support for allocating space in the stack redzone. That could be
   interesting for the register allocator as well. Would help only
   x86_64, but that's a large fraction of gcc's userbase.
 * add support for opportunistically finding unused alignment padding
   in the existing stack frame. Less likely to work but would produce
   better results when it does.
 * on embedded targets we probably don't have to worry about valgrind,
   so do the optimal (sp - x) thing there
 * allocate a single global as the dummy target. Might be more
   expensive to load the address on some targets though.
 * at least find a way to express costs for this transformation.
   Difficult since you don't yet necessarily know if the function is
   going to have a stack frame. Hence, IMO this approach is flawed.
   (You'll still want cost estimates even when not allocating stuff in
   the normal stack frame, because generated code will still execute
   between two and four extra instructions).


Bernd

Re: using scratchpads to enhance RTL-level if-conversion: revised patch

Reply via email to