Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Sun, Jun 25, 2023 at 7:39 AM Roger Sayle <ro...@nextmovesoftware.com> > wrote: >> >> >> On Tue, 13 June 2023 12:02, Richard Biener wrote: >> > On Mon, Jun 12, 2023 at 4:04 PM Roger Sayle <ro...@nextmovesoftware.com> >> > wrote: >> > > The following simple test case, from PR 104610, shows that memcmp () >> > > == 0 can result in some bizarre code sequences on x86. >> > > >> > > int foo(char *a) >> > > { >> > > static const char t[] = "0123456789012345678901234567890"; >> > > return __builtin_memcmp(a, &t[0], sizeof(t)) == 0; } >> > > >> > > with -O2 currently contains both: >> > > xorl %eax, %eax >> > > xorl $1, %eax >> > > and also >> > > movl $1, %eax >> > > xorl $1, %eax >> > > >> > > Changing the return type of foo to _Bool results in the equally >> > > bizarre: >> > > xorl %eax, %eax >> > > testl %eax, %eax >> > > sete %al >> > > and also >> > > movl $1, %eax >> > > testl %eax, %eax >> > > sete %al >> > > >> > > All these sequences set the result to a constant, but this >> > > optimization opportunity only occurs very late during compilation, by >> > > basic block duplication in the 322r.bbro pass, too late for CSE or >> > > peephole2 to do anything about it. The problem is that the idiom >> > > expanded by compare_by_pieces for __builtin_memcmp_eq contains basic >> > > blocks that can't easily be optimized by if-conversion due to the >> > > multiple incoming edges on the fail block. >> > > >> > > In summary, compare_by_pieces generates code that looks like: >> > > >> > > if (x[0] != y[0]) goto fail_label; >> > > if (x[1] != y[1]) goto fail_label; >> > > ... >> > > if (x[n] != y[n]) goto fail_label; >> > > result = 1; >> > > goto end_label; >> > > fail_label: >> > > result = 0; >> > > end_label: >> > > >> > > In theory, the RTL if-conversion pass could be enhanced to tackle >> > > arbitrarily complex if-then-else graphs, but the solution proposed >> > > here is to allow suitable targets to perform if-conversion during >> > > compare_by_pieces. The x86, for example, can take advantage that all >> > > of the above comparisons set and test the zero flag (ZF), which can >> > > then be used in combination with sete. Hence compare_by_pieces could >> > > instead generate: >> > > >> > > if (x[0] != y[0]) goto fail_label; >> > > if (x[1] != y[1]) goto fail_label; >> > > ... >> > > if (x[n] != y[n]) goto fail_label; >> > > fail_label: >> > > sete result >> > > >> > > which requires one less basic block, and the redundant conditional >> > > branch to a label immediately after is cleaned up by GCC's existing >> > > RTL optimizations. >> > > >> > > For the test case above, where -O2 -msse4 previously generated: >> > > >> > > foo: movdqu (%rdi), %xmm0 >> > > pxor .LC0(%rip), %xmm0 >> > > ptest %xmm0, %xmm0 >> > > je .L5 >> > > .L2: movl $1, %eax >> > > xorl $1, %eax >> > > ret >> > > .L5: movdqu 16(%rdi), %xmm0 >> > > pxor .LC1(%rip), %xmm0 >> > > ptest %xmm0, %xmm0 >> > > jne .L2 >> > > xorl %eax, %eax >> > > xorl $1, %eax >> > > ret >> > > >> > > we now generate: >> > > >> > > foo: movdqu (%rdi), %xmm0 >> > > pxor .LC0(%rip), %xmm0 >> > > ptest %xmm0, %xmm0 >> > > jne .L2 >> > > movdqu 16(%rdi), %xmm0 >> > > pxor .LC1(%rip), %xmm0 >> > > ptest %xmm0, %xmm0 >> > > .L2: sete %al >> > > movzbl %al, %eax >> > > ret >> > > >> > > Using a target hook allows the large amount of intelligence already in >> > > compare_by_pieces to be re-used by the i386 backend, but this can also >> > > help other backends with condition flags where the equality result can >> > > be materialized. >> > > >> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap >> > > and make -k check, both with and without --target_board=unix{-m32} >> > > with no new failures. Ok for mainline? >> > >> > What's the guarantee that the zero flag is appropriately set on all edges >> > incoming >> > now and forever? >> >> Is there any reason why this target hook can't be removed (in future) should >> it stop >> being useful? It's completely optional and not required for the correct >> functioning >> of the compiler. >> >> > Does this require target specific knowledge on how do_compare_rtx_and_jump >> > is emitting RTL? >> >> Yes. Each backend can decide how best to implement finish_compare_by_pieces >> given its internal knowledge of how do_compare_rtx_and_jump works. It's not >> important to the middle-end how the underlying invariants are guaranteed, >> just >> that they are and the backend produces correct code. A backend may store >> flags >> on the target label, or maintain state in cfun. Future changes to the i386 >> backend >> might cause it to revert to the default finish_compare_by_pieces, or provide >> an >> alternate implementation, but at the moment this patch improves the code that >> GCC generates. Very little (in software like GCC) is forever. >> >> > Do you see matching this in ifcvt to be unreasonable? I'm thinking of >> > "reducing" >> > the incoming edges pairwise without actually looking at the ifcvt code. >> >> There's nothing about the proposed patch that prevents or blocks improvements >> to ifcvt, which I agree completely could be improved. But even (in future) >> if later >> RTL passes could clean things up, that's no reason for RTL expansion to >> initially >> generate poor/inefficient code. I'm not sure that a (hypothetical) ifcvt >> improvement >> would be sufficient reason to revert/remove enhancements to >> compare_by_pieces. >> >> Is it that there's not enough (bootstrap and) testsuite coverage of >> compare_by_pieces >> to make you feel comfortable with this change? The proposed implementation >> should >> be "obvious enough" to future generations what the intended behaviour should >> be. >> And the x86 target hook implementation (i.e. the change) is only four lines >> long, a >> fraction of the size of the new documentation and comments. > > My main concern was that we are communicating "implicit" dependences between > the target hook and expand RTL code generation - we don't seem to pass down > enough info for example to have the target verify constraints and > excuse itself if > they do not hold. They also seem to be poorly documented and the > compare_by_pieces (and all _by_pieces) stuff has been extended to be quite > "configurable" and thus is probably prone to emit vastly different RTL in > some special cases?
FWIW, I share these concerns, especially about the target hook not being able to verify its assumptions. Maybe it wouldn't matter too much if the worst outcome was a missed optimisation, but I don't think the mechanism is reliable enough for us to rely on it for correctness. Richard