On Tue, Jun 6, 2023 at 11:00 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Might you willing to approve the patch without the *x86_clc pieces?
> These can be submitted later, when they are actually used.  For now,
> we're arguing about the performance of a pattern that's not yet
> generated on an obsolete microarchitecture that's no longer in use,
> and this is holding up real improvements on current processors.
> cmc, for example, should allow for better cmov if-conversion.

Then the patch is OK, the above can be implemented in an incremental patch.

Thanks,
Uros.

>
> Thanks in advance.
> Roger
> --
>
> -----Original Message-----
> From: Uros Bizjak <ubiz...@gmail.com>
> Sent: 06 June 2023 18:34
> To: Roger Sayle <ro...@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions in 
> i386.md
>
> On Tue, Jun 6, 2023 at 5:14 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
> >
> >
> > Hi Uros,
> > This revision implements your suggestions/refinements. (i) Avoid the
> > UNSPEC_CMC by using the canonical RTL idiom for *x86_cmc, (ii) Use
> > peephole2s to convert x86_stc and *x86_cmc into alternate forms on
> > TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is
> > available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al)
> > over negqi_ccc_1 (neg %al) for setting the carry from a QImode value,
> > (iv) Use andl %eax,%eax to clear carry flag without requiring
> > (clobbering) an additional register, as an alternate output template for 
> > *x86_clc.
>
> Uh, I don't think (iv) is OK. "xor reg,reg" will break the dependency chain, 
> while "and reg,reg" won't. So, you are hurting out-of-order execution by 
> depending on an instruction that calculates previous result in reg. You can 
> use peephole2 trick to allocate an unused reg here, but then using AND is no 
> better than using XOR, and the latter is guaranteed to break dependency 
> chains.
>
> Uros.
>
> > These changes required two minor edits to i386.cc:  ix86_cc_mode had
> > to be tweaked to suggest CCCmode for the new *x86_cmc pattern, and
> > *x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that
> > combine would appreciate that this complex RTL expression was actually
> > a fast, single byte instruction [i.e. preferable].
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> > 2022-06-06  Roger Sayle  <ro...@nextmovesoftware.com>
> >             Uros Bizjak  <ubiz...@gmail.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386-expand.cc (ix86_expand_builtin) <handlecarry>:
> >         Use new x86_stc instruction when the carry flag must be set.
> >         * config/i386/i386.cc (ix86_cc_mode): Use CCCmode for *x86_cmc.
> >         (ix86_rtx_costs): Provide accurate rtx_costs for *x86_cmc.
> >         * config/i386/i386.h (TARGET_SLOW_STC): New define.
> >         * config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc.
> >         (UNSPEC_STC): New UNSPEC for stc.
> >         (*x86_clc): New define_insn (with implementation for pentium4).
> >         (x86_stc): New define_insn.
> >         (define_peephole2): Convert x86_stc into alternate implementation
> >         on pentium4 without -Os when a QImode register is available.
> >         (*x86_cmc): New define_insn.
> >         (define_peephole2): Convert *x86_cmc into alternate implementation
> >         on pentium4 without -Os when a QImode register is available.
> >         (*setccc): New define_insn_and_split for a no-op CCCmode move.
> >         (*setcc_qi_negqi_ccc_1_<mode>): New define_insn_and_split to
> >         recognize (and eliminate) the carry flag being copied to itself.
> >         (*setcc_qi_negqi_ccc_2_<mode>): Likewise.
> >         * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/cmc-1.c: New test case.
> >         * gcc.target/i386/stc-1.c: Likewise.
> >
> >
> > Thanks, Roger.
> > --
> >
> > -----Original Message-----
> > From: Uros Bizjak <ubiz...@gmail.com>
> > Sent: 04 June 2023 18:53
> > To: Roger Sayle <ro...@nextmovesoftware.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions
> > in i386.md
> >
> > On Sun, Jun 4, 2023 at 12:45 AM Roger Sayle <ro...@nextmovesoftware.com> 
> > wrote:
> > >
> > >
> > > This patch is the latest revision of my patch to add support for the
> > > STC (set carry flag), CLC (clear carry flag) and CMC (complement
> > > carry
> > > flag) instructions to the i386 backend, incorporating Uros'
> > > previous feedback.  The significant changes are (i) the inclusion of
> > > CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new
> > > X86_TUNE_SLOW_STC tuning flag to use alternate implementations on
> > > pentium4 (which has a notoriously slow STC) when not optimizing for
> > > size.
> > >
> > > An example of the use of the stc instruction is:
> > > unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) {
> > >   return __builtin_ia32_addcarryx_u32 (1, a, b, c); }
> > >
> > > which previously generated:
> > >         movl    $1, %eax
> > >         addb    $-1, %al
> > >         adcl    %esi, %edi
> > >         setc    %al
> > >         movl    %edi, (%rdx)
> > >         movzbl  %al, %eax
> > >         ret
> > >
> > > with this patch now generates:
> > >         stc
> > >         adcl    %esi, %edi
> > >         setc    %al
> > >         movl    %edi, (%rdx)
> > >         movzbl  %al, %eax
> > >         ret
> > >
> > > An example of the use of the cmc instruction (where the carry from a
> > > first adc is inverted/complemented as input to a second adc) is:
> > > unsigned int bar (unsigned int a, unsigned int b,
> > >                   unsigned int c, unsigned int d) {
> > >   unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1);
> > >   return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2); }
> > >
> > > which previously generated:
> > >         movl    $1, %eax
> > >         addb    $-1, %al
> > >         adcl    %esi, %edi
> > >         setnc   %al
> > >         movl    %edi, o1(%rip)
> > >         addb    $-1, %al
> > >         adcl    %ecx, %edx
> > >         setc    %al
> > >         movl    %edx, o2(%rip)
> > >         movzbl  %al, %eax
> > >         ret
> > >
> > > and now generates:
> > >         stc
> > >         adcl    %esi, %edi
> > >         cmc
> > >         movl    %edi, o1(%rip)
> > >         adcl    %ecx, %edx
> > >         setc    %al
> > >         movl    %edx, o2(%rip)
> > >         movzbl  %al, %eax
> > >         ret
> > >
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > bootstrap and make -k check, both with and without
> > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > >
> > >
> > > 2022-06-03  Roger Sayle  <ro...@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386-expand.cc (ix86_expand_builtin) <handlecarry>:
> > >         Use new x86_stc or negqi_ccc_1 instructions to set the carry flag.
> > >         * config/i386/i386.h (TARGET_SLOW_STC): New define.
> > >         * config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc.
> > >         (UNSPEC_STC): New UNSPEC for stc.
> > >         (UNSPEC_CMC): New UNSPEC for cmc.
> > >         (*x86_clc): New define_insn.
> > >         (*x86_clc_xor): New define_insn for pentium4 without -Os.
> > >         (x86_stc): New define_insn.
> > >         (define_split): Convert x86_stc into alternate implementation
> > >         on pentium4.
> > >         (x86_cmc): New define_insn.
> > >         (*x86_cmc_1): New define_insn_and_split to recognize cmc pattern.
> > >         (*setcc_qi_negqi_ccc_1_<mode>): New define_insn_and_split to
> > >         recognize (and eliminate) the carry flag being copied to itself.
> > >         (*setcc_qi_negqi_ccc_2_<mode>): Likewise.
> > >         (neg<mode>_ccc_1): Renamed from *neg<mode>_ccc_1 for gen function.
> > >         * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag.
> > >
> > > gcc/testsuite/ChangeLog
> > >         * gcc.target/i386/cmc-1.c: New test case.
> > >         * gcc.target/i386/stc-1.c: Likewise.
> >
> > +;; Clear carry flag.
> > +(define_insn "*x86_clc"
> > +  [(set (reg:CCC FLAGS_REG) (unspec:CCC [(const_int 0)] UNSPEC_CLC))]
> > +  "!TARGET_SLOW_STC || optimize_function_for_size_p (cfun)"
> > +  "clc"
> > +  [(set_attr "length" "1")
> > +   (set_attr "length_immediate" "0")
> > +   (set_attr "modrm" "0")])
> > +
> > +(define_insn "*x86_clc_xor"
> > +  [(set (reg:CCC FLAGS_REG) (unspec:CCC [(const_int 0)] UNSPEC_CLC))
> > +   (clobber (match_scratch:SI 0 "=r"))]
> > +  "TARGET_SLOW_STC && !optimize_function_for_size_p (cfun)"
> > +  "xor{l}\t%0, %0"
> > +    [(set_attr "type" "alu1")
> > +     (set_attr "mode" "SI")
> > +     (set_attr "length_immediate" "0")])
> >
> > I think the above would be better implemented as a peephole2 pattern that 
> > triggers when a register is available. We should not waste a register on a 
> > register starved x86_32 just to set a carry flag. This is implemented with:
> >
> >   [(match_scratch:SI 2 "r")
> >
> > at the beginning of the peephole2 pattern that generates x86_clc_xor.
> > The pattern should be constrained with "TARGET_SLOW_STC && 
> > !optimize_function_for_size_p()" and x86_clc_xor should be available only 
> > after reload (like e.g. "*mov<mode>_xor").
> >
> > +;; On Pentium 4, set the carry flag using mov $1,%al;neg %al.
> > +(define_split
> > +  [(set (reg:CCC FLAGS_REG) (unspec:CCC [(const_int 0)] UNSPEC_STC))]
> > +  "TARGET_SLOW_STC
> > +   && !optimize_insn_for_size_p ()
> > +   && can_create_pseudo_p ()"
> > +  [(set (match_dup 0) (const_int 1))
> > +   (parallel
> > +     [(set (reg:CCC FLAGS_REG)
> > +       (unspec:CCC [(match_dup 0) (const_int 0)] UNSPEC_CC_NE))
> > +      (set (match_dup 0) (neg:QI (match_dup 0)))])]
> > +  "operands[0] = gen_reg_rtx (QImode);")
> >
> > Same here, this should be a peephole2 to trigger only when a register is 
> > available, again for "TARGET_SLOW_STC && !optimize_function_for_size_p()".
> >
> > Is the new sequence "mov $1,%al; neg %al" any better than existing "mov 
> > $1,%al; addb $-1,%al" or is there another reason for the changed sequence?
> >
> > +(define_insn_and_split "*x86_cmc_1"
> > +  [(set (reg:CCC FLAGS_REG)
> > +        (unspec:CCC [(geu:QI (reg:CCC FLAGS_REG) (const_int 0))
> > +             (const_int 0)] UNSPEC_CC_NE))]
> > +  "!TARGET_SLOW_STC || optimize_function_for_size_p (cfun)"
> > +  "#"
> > +  "&& 1"
> > +  [(set (reg:CCC FLAGS_REG) (unspec:CCC [(reg:CCC FLAGS_REG)]
> > +UNSPEC_CMC))])
> >
> > The above should be the RTL model of x86_cmc. No need to split to an unspec.
> >
> > Uros.
>

Reply via email to