combine woes

Alexandre Oliva Tue, 01 Sep 2020 15:23:50 -0700


This WIP patchlet introduces means for machines that implicitly clobber
cc flags in asm statements, but that have machinery to output flags
(namely x86, non-thumb arm and aarch64), to state that the asm statement
does NOT clobber cc flags.  That's accomplished by using "=@ccC" in the
output constraints.  It disables the implicit clobber, but it doesn't
set up an actual asm output to the flags, so they are left alone.


It's ugly, I know.  I've considered "!cc" or "nocc" in the clobber
list as a machine-independent way to signal cc is not modified, or
even special-casing empty asm patterns (at a slight risk of breaking
code that expects implicit clobbers even for empty asm patterns, but
empty asm patterns won't clobber flags, so how could it break
anything?).  I take this might be useful for do-nothing asm
statements, often used to stop certain optimizations, e.g.:

  __typeof (*p) c = __builtin_add_overflow (*p, 1, p);
  asm ("" : "+m" (*p)); // Make sure we write to memory.
  *p += c; // This should compile into an add with carry.

Is there interest in, and a preferred form for (portably?), conveying
a no-cc-clobbering asm?


Without the asm, we issue load;add;adc;store, which is not the ideal
sequence with add and adc to the same memory address (or two different
addresses, if the last statement uses say *q instead of *p).

Without flags clobbering, we end up saving the flag to a register, and
then we use add for the final statement.  Its load/plus/store gimple
sequence is successfully optimized to a memory add with TER.  With
arrangements to stop the asm from clobbering the flags, we get an adc to
memory, and don't waste cycles saving the wanted flag to a register.

Alas, getting the first add to go straight to memory is more
complicated.  Even with the asm that forces the output to memory, the
output flag makes it harder to get it optimized to an add-to-memory
form.  When the output flag is unused, we optimize it enough in gimple
that TER does its job and we issue a single add, but that's not possible
when the two outputs of ADD_OVERFLOW are used: the flag setting gets
optimized away, but only after stopping combine from turning the
load/add/store into an add-to-memory.

If we retried the 3-insn substitution after substituting the flag store
into the add for the adc, we might succeed, but only if we had a pattern
that matched add<mode>3_cc_overflow_1's parallel with the flag-setter as
the second element of the parallel, because that's where combine adds it
to the new i3 pattern, after splitting it out of i2.

I suppose adding such patterns manually isn't the way to go.  I wonder
if getting recog_for_combine to recognize and reorder PARALLELs
appearing out of order would get too expensive, even if genrecog were to
generate optimized code to try alternate orders in parallels.

The issue doesn't seem that important in the grand scheme of things, but
there is some embarrassment from the missed combines and from the AFAICT
impossibility to get GCC to issue the most compact (and possibly
fastest) insn sequence on x86* for a 'memory += value;' spelled as
__builtin_add_overflow, when the result of the overflow checking is
used.

Part of the issue is that the *_overflow builtins, despite taking a
pointer to hold the op result, split the pointer away from the IFN
interface, which makes for some optimizations, but that ends up
preventing the most desirable (?) translation when adding to a variable
in memory.

Thoughts?  Suggestions?  Any tricks to share to get the desired
sequence?  Thanks in advance,
---
 gcc/config/arm/aarch-common.c |    8 +++++++-
 gcc/config/i386/i386.c        |    6 ++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/gcc/config/arm/aarch-common.c b/gcc/config/arm/aarch-common.c
index 6bc6ccf..ebe034c 100644
--- a/gcc/config/arm/aarch-common.c
+++ b/gcc/config/arm/aarch-common.c
@@ -557,11 +557,17 @@ arm_md_asm_adjust (vec<rtx> &outputs, vec<rtx> 
&/*inputs*/,
 #define C(X, Y)  (unsigned char)(X) * 256 + (unsigned char)(Y)
 
       /* All of the condition codes are two characters.  */
-      if (con[0] != 0 && con[1] != 0 && con[2] == 0)
+      if (con[0] != 0 && (con[1] == 0 || con[2] == 0))
        con01 = C(con[0], con[1]);
 
       switch (con01)
        {
+       case C('C', 0):
+         saw_asm_flag = true;
+         constraints[i] = "=X";
+         outputs[i] = gen_rtx_SCRATCH (SImode);
+         continue;
+
        case C('c', 'c'):
        case C('l', 'o'):
          mode = CC_Cmode, code = GEU;
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a15807d..56e086a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21021,6 +21021,12 @@ ix86_md_asm_adjust (vec<rtx> &outputs, vec<rtx> 
&/*inputs*/,
 
       switch (con[0])
        {
+       case 'C':
+         saw_asm_flag = true;
+         constraints[i] = "=X";
+         outputs[i] = gen_rtx_SCRATCH (SImode);
+         continue;
+
        case 'a':
          if (con[1] == 0)
            mode = CCAmode, code = EQ;


-- 
Alexandre Oliva, happy hacker
https://FSFLA.org/blogs/lxo/
Free Software Activist
GNU Toolchain Engineer

[RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

Reply via email to