On Tue, Jun 6, 2023 at 11:00 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > Hi Uros, > Might you willing to approve the patch without the *x86_clc pieces? > These can be submitted later, when they are actually used. For now, > we're arguing about the performance of a pattern that's not yet > generated on an obsolete microarchitecture that's no longer in use, > and this is holding up real improvements on current processors. > cmc, for example, should allow for better cmov if-conversion.
Then the patch is OK, the above can be implemented in an incremental patch. Thanks, Uros. > > Thanks in advance. > Roger > -- > > -----Original Message----- > From: Uros Bizjak <ubiz...@gmail.com> > Sent: 06 June 2023 18:34 > To: Roger Sayle <ro...@nextmovesoftware.com> > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions in > i386.md > > On Tue, Jun 6, 2023 at 5:14 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > > > > Hi Uros, > > This revision implements your suggestions/refinements. (i) Avoid the > > UNSPEC_CMC by using the canonical RTL idiom for *x86_cmc, (ii) Use > > peephole2s to convert x86_stc and *x86_cmc into alternate forms on > > TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is > > available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al) > > over negqi_ccc_1 (neg %al) for setting the carry from a QImode value, > > (iv) Use andl %eax,%eax to clear carry flag without requiring > > (clobbering) an additional register, as an alternate output template for > > *x86_clc. > > Uh, I don't think (iv) is OK. "xor reg,reg" will break the dependency chain, > while "and reg,reg" won't. So, you are hurting out-of-order execution by > depending on an instruction that calculates previous result in reg. You can > use peephole2 trick to allocate an unused reg here, but then using AND is no > better than using XOR, and the latter is guaranteed to break dependency > chains. > > Uros. > > > These changes required two minor edits to i386.cc: ix86_cc_mode had > > to be tweaked to suggest CCCmode for the new *x86_cmc pattern, and > > *x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that > > combine would appreciate that this complex RTL expression was actually > > a fast, single byte instruction [i.e. preferable]. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > 2022-06-06 Roger Sayle <ro...@nextmovesoftware.com> > > Uros Bizjak <ubiz...@gmail.com> > > > > gcc/ChangeLog > > * config/i386/i386-expand.cc (ix86_expand_builtin) <handlecarry>: > > Use new x86_stc instruction when the carry flag must be set. > > * config/i386/i386.cc (ix86_cc_mode): Use CCCmode for *x86_cmc. > > (ix86_rtx_costs): Provide accurate rtx_costs for *x86_cmc. > > * config/i386/i386.h (TARGET_SLOW_STC): New define. > > * config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc. > > (UNSPEC_STC): New UNSPEC for stc. > > (*x86_clc): New define_insn (with implementation for pentium4). > > (x86_stc): New define_insn. > > (define_peephole2): Convert x86_stc into alternate implementation > > on pentium4 without -Os when a QImode register is available. > > (*x86_cmc): New define_insn. > > (define_peephole2): Convert *x86_cmc into alternate implementation > > on pentium4 without -Os when a QImode register is available. > > (*setccc): New define_insn_and_split for a no-op CCCmode move. > > (*setcc_qi_negqi_ccc_1_<mode>): New define_insn_and_split to > > recognize (and eliminate) the carry flag being copied to itself. > > (*setcc_qi_negqi_ccc_2_<mode>): Likewise. > > * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/cmc-1.c: New test case. > > * gcc.target/i386/stc-1.c: Likewise. > > > > > > Thanks, Roger. > > -- > > > > -----Original Message----- > > From: Uros Bizjak <ubiz...@gmail.com> > > Sent: 04 June 2023 18:53 > > To: Roger Sayle <ro...@nextmovesoftware.com> > > Cc: gcc-patches@gcc.gnu.org > > Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions > > in i386.md > > > > On Sun, Jun 4, 2023 at 12:45 AM Roger Sayle <ro...@nextmovesoftware.com> > > wrote: > > > > > > > > > This patch is the latest revision of my patch to add support for the > > > STC (set carry flag), CLC (clear carry flag) and CMC (complement > > > carry > > > flag) instructions to the i386 backend, incorporating Uros' > > > previous feedback. The significant changes are (i) the inclusion of > > > CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new > > > X86_TUNE_SLOW_STC tuning flag to use alternate implementations on > > > pentium4 (which has a notoriously slow STC) when not optimizing for > > > size. > > > > > > An example of the use of the stc instruction is: > > > unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) { > > > return __builtin_ia32_addcarryx_u32 (1, a, b, c); } > > > > > > which previously generated: > > > movl $1, %eax > > > addb $-1, %al > > > adcl %esi, %edi > > > setc %al > > > movl %edi, (%rdx) > > > movzbl %al, %eax > > > ret > > > > > > with this patch now generates: > > > stc > > > adcl %esi, %edi > > > setc %al > > > movl %edi, (%rdx) > > > movzbl %al, %eax > > > ret > > > > > > An example of the use of the cmc instruction (where the carry from a > > > first adc is inverted/complemented as input to a second adc) is: > > > unsigned int bar (unsigned int a, unsigned int b, > > > unsigned int c, unsigned int d) { > > > unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1); > > > return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2); } > > > > > > which previously generated: > > > movl $1, %eax > > > addb $-1, %al > > > adcl %esi, %edi > > > setnc %al > > > movl %edi, o1(%rip) > > > addb $-1, %al > > > adcl %ecx, %edx > > > setc %al > > > movl %edx, o2(%rip) > > > movzbl %al, %eax > > > ret > > > > > > and now generates: > > > stc > > > adcl %esi, %edi > > > cmc > > > movl %edi, o1(%rip) > > > adcl %ecx, %edx > > > setc %al > > > movl %edx, o2(%rip) > > > movzbl %al, %eax > > > ret > > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > bootstrap and make -k check, both with and without > > > --target_board=unix{-m32} with no new failures. Ok for mainline? > > > > > > > > > 2022-06-03 Roger Sayle <ro...@nextmovesoftware.com> > > > > > > gcc/ChangeLog > > > * config/i386/i386-expand.cc (ix86_expand_builtin) <handlecarry>: > > > Use new x86_stc or negqi_ccc_1 instructions to set the carry flag. > > > * config/i386/i386.h (TARGET_SLOW_STC): New define. > > > * config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc. > > > (UNSPEC_STC): New UNSPEC for stc. > > > (UNSPEC_CMC): New UNSPEC for cmc. > > > (*x86_clc): New define_insn. > > > (*x86_clc_xor): New define_insn for pentium4 without -Os. > > > (x86_stc): New define_insn. > > > (define_split): Convert x86_stc into alternate implementation > > > on pentium4. > > > (x86_cmc): New define_insn. > > > (*x86_cmc_1): New define_insn_and_split to recognize cmc pattern. > > > (*setcc_qi_negqi_ccc_1_<mode>): New define_insn_and_split to > > > recognize (and eliminate) the carry flag being copied to itself. > > > (*setcc_qi_negqi_ccc_2_<mode>): Likewise. > > > (neg<mode>_ccc_1): Renamed from *neg<mode>_ccc_1 for gen function. > > > * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag. > > > > > > gcc/testsuite/ChangeLog > > > * gcc.target/i386/cmc-1.c: New test case. > > > * gcc.target/i386/stc-1.c: Likewise. > > > > +;; Clear carry flag. > > +(define_insn "*x86_clc" > > + [(set (reg:CCC FLAGS_REG) (unspec:CCC [(const_int 0)] UNSPEC_CLC))] > > + "!TARGET_SLOW_STC || optimize_function_for_size_p (cfun)" > > + "clc" > > + [(set_attr "length" "1") > > + (set_attr "length_immediate" "0") > > + (set_attr "modrm" "0")]) > > + > > +(define_insn "*x86_clc_xor" > > + [(set (reg:CCC FLAGS_REG) (unspec:CCC [(const_int 0)] UNSPEC_CLC)) > > + (clobber (match_scratch:SI 0 "=r"))] > > + "TARGET_SLOW_STC && !optimize_function_for_size_p (cfun)" > > + "xor{l}\t%0, %0" > > + [(set_attr "type" "alu1") > > + (set_attr "mode" "SI") > > + (set_attr "length_immediate" "0")]) > > > > I think the above would be better implemented as a peephole2 pattern that > > triggers when a register is available. We should not waste a register on a > > register starved x86_32 just to set a carry flag. This is implemented with: > > > > [(match_scratch:SI 2 "r") > > > > at the beginning of the peephole2 pattern that generates x86_clc_xor. > > The pattern should be constrained with "TARGET_SLOW_STC && > > !optimize_function_for_size_p()" and x86_clc_xor should be available only > > after reload (like e.g. "*mov<mode>_xor"). > > > > +;; On Pentium 4, set the carry flag using mov $1,%al;neg %al. > > +(define_split > > + [(set (reg:CCC FLAGS_REG) (unspec:CCC [(const_int 0)] UNSPEC_STC))] > > + "TARGET_SLOW_STC > > + && !optimize_insn_for_size_p () > > + && can_create_pseudo_p ()" > > + [(set (match_dup 0) (const_int 1)) > > + (parallel > > + [(set (reg:CCC FLAGS_REG) > > + (unspec:CCC [(match_dup 0) (const_int 0)] UNSPEC_CC_NE)) > > + (set (match_dup 0) (neg:QI (match_dup 0)))])] > > + "operands[0] = gen_reg_rtx (QImode);") > > > > Same here, this should be a peephole2 to trigger only when a register is > > available, again for "TARGET_SLOW_STC && !optimize_function_for_size_p()". > > > > Is the new sequence "mov $1,%al; neg %al" any better than existing "mov > > $1,%al; addb $-1,%al" or is there another reason for the changed sequence? > > > > +(define_insn_and_split "*x86_cmc_1" > > + [(set (reg:CCC FLAGS_REG) > > + (unspec:CCC [(geu:QI (reg:CCC FLAGS_REG) (const_int 0)) > > + (const_int 0)] UNSPEC_CC_NE))] > > + "!TARGET_SLOW_STC || optimize_function_for_size_p (cfun)" > > + "#" > > + "&& 1" > > + [(set (reg:CCC FLAGS_REG) (unspec:CCC [(reg:CCC FLAGS_REG)] > > +UNSPEC_CMC))]) > > > > The above should be the RTL model of x86_cmc. No need to split to an unspec. > > > > Uros. >