Re: [PATCH] New finish_compare_by_pieces target hook (for x86).

Richard Sandiford via Gcc-patches Mon, 26 Jun 2023 11:24:39 -0700

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Sun, Jun 25, 2023 at 7:39 AM Roger Sayle <ro...@nextmovesoftware.com> 
> wrote:
>>
>>
>> On Tue, 13 June 2023 12:02, Richard Biener wrote:
>> > On Mon, Jun 12, 2023 at 4:04 PM Roger Sayle <ro...@nextmovesoftware.com>
>> > wrote:
>> > > The following simple test case, from PR 104610, shows that memcmp ()
>> > > == 0 can result in some bizarre code sequences on x86.
>> > >
>> > > int foo(char *a)
>> > > {
>> > >     static const char t[] = "0123456789012345678901234567890";
>> > >     return __builtin_memcmp(a, &t[0], sizeof(t)) == 0; }
>> > >
>> > > with -O2 currently contains both:
>> > >         xorl    %eax, %eax
>> > >         xorl    $1, %eax
>> > > and also
>> > >         movl    $1, %eax
>> > >         xorl    $1, %eax
>> > >
>> > > Changing the return type of foo to _Bool results in the equally
>> > > bizarre:
>> > >         xorl    %eax, %eax
>> > >         testl   %eax, %eax
>> > >         sete    %al
>> > > and also
>> > >         movl    $1, %eax
>> > >         testl   %eax, %eax
>> > >         sete    %al
>> > >
>> > > All these sequences set the result to a constant, but this
>> > > optimization opportunity only occurs very late during compilation, by
>> > > basic block duplication in the 322r.bbro pass, too late for CSE or
>> > > peephole2 to do anything about it.  The problem is that the idiom
>> > > expanded by compare_by_pieces for __builtin_memcmp_eq contains basic
>> > > blocks that can't easily be optimized by if-conversion due to the
>> > > multiple incoming edges on the fail block.
>> > >
>> > > In summary, compare_by_pieces generates code that looks like:
>> > >
>> > >         if (x[0] != y[0]) goto fail_label;
>> > >         if (x[1] != y[1]) goto fail_label;
>> > >         ...
>> > >         if (x[n] != y[n]) goto fail_label;
>> > >         result = 1;
>> > >         goto end_label;
>> > > fail_label:
>> > >         result = 0;
>> > > end_label:
>> > >
>> > > In theory, the RTL if-conversion pass could be enhanced to tackle
>> > > arbitrarily complex if-then-else graphs, but the solution proposed
>> > > here is to allow suitable targets to perform if-conversion during
>> > > compare_by_pieces.  The x86, for example, can take advantage that all
>> > > of the above comparisons set and test the zero flag (ZF), which can
>> > > then be used in combination with sete.  Hence compare_by_pieces could
>> > > instead generate:
>> > >
>> > >         if (x[0] != y[0]) goto fail_label;
>> > >         if (x[1] != y[1]) goto fail_label;
>> > >         ...
>> > >         if (x[n] != y[n]) goto fail_label;
>> > > fail_label:
>> > >         sete result
>> > >
>> > > which requires one less basic block, and the redundant conditional
>> > > branch to a label immediately after is cleaned up by GCC's existing
>> > > RTL optimizations.
>> > >
>> > > For the test case above, where -O2 -msse4 previously generated:
>> > >
>> > > foo:    movdqu  (%rdi), %xmm0
>> > >         pxor    .LC0(%rip), %xmm0
>> > >         ptest   %xmm0, %xmm0
>> > >         je      .L5
>> > > .L2:    movl    $1, %eax
>> > >         xorl    $1, %eax
>> > >         ret
>> > > .L5:    movdqu  16(%rdi), %xmm0
>> > >         pxor    .LC1(%rip), %xmm0
>> > >         ptest   %xmm0, %xmm0
>> > >         jne     .L2
>> > >         xorl    %eax, %eax
>> > >         xorl    $1, %eax
>> > >         ret
>> > >
>> > > we now generate:
>> > >
>> > > foo:    movdqu  (%rdi), %xmm0
>> > >         pxor    .LC0(%rip), %xmm0
>> > >         ptest   %xmm0, %xmm0
>> > >         jne     .L2
>> > >         movdqu  16(%rdi), %xmm0
>> > >         pxor    .LC1(%rip), %xmm0
>> > >         ptest   %xmm0, %xmm0
>> > > .L2:    sete    %al
>> > >         movzbl  %al, %eax
>> > >         ret
>> > >
>> > > Using a target hook allows the large amount of intelligence already in
>> > > compare_by_pieces to be re-used by the i386 backend, but this can also
>> > > help other backends with condition flags where the equality result can
>> > > be materialized.
>> > >
>> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
>> > > and make -k check, both with and without --target_board=unix{-m32}
>> > > with no new failures.  Ok for mainline?
>> >
>> > What's the guarantee that the zero flag is appropriately set on all edges 
>> > incoming
>> > now and forever?
>>
>> Is there any reason why this target hook can't be removed (in future) should 
>> it stop
>> being useful?  It's completely optional and not required for the correct 
>> functioning
>> of the compiler.
>>
>> > Does this require target specific knowledge on how do_compare_rtx_and_jump
>> > is emitting RTL?
>>
>> Yes.  Each backend can decide how best to implement finish_compare_by_pieces
>> given its internal knowledge of how do_compare_rtx_and_jump works.  It's not
>> important to the middle-end how the underlying invariants are guaranteed, 
>> just
>> that they are and the backend produces correct code.  A backend may store 
>> flags
>> on the target label, or maintain state in cfun.  Future changes to the i386 
>> backend
>> might cause it to revert to the default finish_compare_by_pieces, or provide 
>> an
>> alternate implementation, but at the moment this patch improves the code that
>> GCC generates.  Very little (in software like GCC) is forever.
>>
>> > Do you see matching this in ifcvt to be unreasonable?  I'm thinking of 
>> > "reducing"
>> > the incoming edges pairwise without actually looking at the ifcvt code.
>>
>> There's nothing about the proposed patch that prevents or blocks improvements
>> to ifcvt, which I agree completely could be improved.  But even (in future) 
>> if later
>> RTL passes could clean things up, that's no reason for RTL expansion to 
>> initially
>> generate poor/inefficient code.  I'm not sure that a (hypothetical) ifcvt 
>> improvement
>> would be sufficient reason to revert/remove enhancements to 
>> compare_by_pieces.
>>
>> Is it that there's not enough (bootstrap and) testsuite coverage of 
>> compare_by_pieces
>> to make you feel comfortable with this change?  The proposed implementation 
>> should
>> be "obvious enough" to future generations what the intended behaviour should 
>> be.
>> And the x86 target hook implementation (i.e. the change) is only four lines 
>> long, a
>> fraction of the size of the new documentation and comments.
>
> My main concern was that we are communicating "implicit" dependences between
> the target hook and expand RTL code generation - we don't seem to pass down
> enough info for example to have the target verify constraints and
> excuse itself if
> they do not hold.  They also seem to be poorly documented and the
> compare_by_pieces (and all _by_pieces) stuff has been extended to be quite
> "configurable" and thus is probably prone to emit vastly different RTL in
> some special cases?


FWIW, I share these concerns, especially about the target hook not
being able to verify its assumptions.  Maybe it wouldn't matter too
much if the worst outcome was a missed optimisation, but I don't think
the mechanism is reliable enough for us to rely on it for correctness.

Richard

Re: [PATCH] New finish_compare_by_pieces target hook (for x86).

Reply via email to