This WIP patchlet introduces means for machines that implicitly clobber cc flags in asm statements, but that have machinery to output flags (namely x86, non-thumb arm and aarch64), to state that the asm statement does NOT clobber cc flags. That's accomplished by using "=@ccC" in the output constraints. It disables the implicit clobber, but it doesn't set up an actual asm output to the flags, so they are left alone.
It's ugly, I know. I've considered "!cc" or "nocc" in the clobber list as a machine-independent way to signal cc is not modified, or even special-casing empty asm patterns (at a slight risk of breaking code that expects implicit clobbers even for empty asm patterns, but empty asm patterns won't clobber flags, so how could it break anything?). I take this might be useful for do-nothing asm statements, often used to stop certain optimizations, e.g.: __typeof (*p) c = __builtin_add_overflow (*p, 1, p); asm ("" : "+m" (*p)); // Make sure we write to memory. *p += c; // This should compile into an add with carry. Is there interest in, and a preferred form for (portably?), conveying a no-cc-clobbering asm? Without the asm, we issue load;add;adc;store, which is not the ideal sequence with add and adc to the same memory address (or two different addresses, if the last statement uses say *q instead of *p). Without flags clobbering, we end up saving the flag to a register, and then we use add for the final statement. Its load/plus/store gimple sequence is successfully optimized to a memory add with TER. With arrangements to stop the asm from clobbering the flags, we get an adc to memory, and don't waste cycles saving the wanted flag to a register. Alas, getting the first add to go straight to memory is more complicated. Even with the asm that forces the output to memory, the output flag makes it harder to get it optimized to an add-to-memory form. When the output flag is unused, we optimize it enough in gimple that TER does its job and we issue a single add, but that's not possible when the two outputs of ADD_OVERFLOW are used: the flag setting gets optimized away, but only after stopping combine from turning the load/add/store into an add-to-memory. If we retried the 3-insn substitution after substituting the flag store into the add for the adc, we might succeed, but only if we had a pattern that matched add<mode>3_cc_overflow_1's parallel with the flag-setter as the second element of the parallel, because that's where combine adds it to the new i3 pattern, after splitting it out of i2. I suppose adding such patterns manually isn't the way to go. I wonder if getting recog_for_combine to recognize and reorder PARALLELs appearing out of order would get too expensive, even if genrecog were to generate optimized code to try alternate orders in parallels. The issue doesn't seem that important in the grand scheme of things, but there is some embarrassment from the missed combines and from the AFAICT impossibility to get GCC to issue the most compact (and possibly fastest) insn sequence on x86* for a 'memory += value;' spelled as __builtin_add_overflow, when the result of the overflow checking is used. Part of the issue is that the *_overflow builtins, despite taking a pointer to hold the op result, split the pointer away from the IFN interface, which makes for some optimizations, but that ends up preventing the most desirable (?) translation when adding to a variable in memory. Thoughts? Suggestions? Any tricks to share to get the desired sequence? Thanks in advance, --- gcc/config/arm/aarch-common.c | 8 +++++++- gcc/config/i386/i386.c | 6 ++++++ 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/gcc/config/arm/aarch-common.c b/gcc/config/arm/aarch-common.c index 6bc6ccf..ebe034c 100644 --- a/gcc/config/arm/aarch-common.c +++ b/gcc/config/arm/aarch-common.c @@ -557,11 +557,17 @@ arm_md_asm_adjust (vec<rtx> &outputs, vec<rtx> &/*inputs*/, #define C(X, Y) (unsigned char)(X) * 256 + (unsigned char)(Y) /* All of the condition codes are two characters. */ - if (con[0] != 0 && con[1] != 0 && con[2] == 0) + if (con[0] != 0 && (con[1] == 0 || con[2] == 0)) con01 = C(con[0], con[1]); switch (con01) { + case C('C', 0): + saw_asm_flag = true; + constraints[i] = "=X"; + outputs[i] = gen_rtx_SCRATCH (SImode); + continue; + case C('c', 'c'): case C('l', 'o'): mode = CC_Cmode, code = GEU; diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index a15807d..56e086a 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -21021,6 +21021,12 @@ ix86_md_asm_adjust (vec<rtx> &outputs, vec<rtx> &/*inputs*/, switch (con[0]) { + case 'C': + saw_asm_flag = true; + constraints[i] = "=X"; + outputs[i] = gen_rtx_SCRATCH (SImode); + continue; + case 'a': if (con[1] == 0) mode = CCAmode, code = EQ; -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer