On Mon, Aug 18, 2014 at 9:16 PM, H.J. Lu <hjl.to...@gmail.com> wrote:

>> Attached patch fixes the problem with false data dependency on output
>> register for popcnt, lzcnt and tzcnt insns on sandybridge and haswell
>> targets.
>>
>> The new insn pattern shadows existing one, and after reload, the
>> clearing isns is split out of the insn. This way the clearing insn can
>> be scheduled by postreload scheduler. The new pattern takes care to
>> avoid live registers, so the compiler is always able to clear output
>> reg.
>>
>> The testcase from the PR, compiled with -O3 -march=corei7 improves on
>> Ivybridge from:
>>
>> unsigned        209717360000    3.21002 sec     16.3329 GB/s
>> uint64_t        209717360000    4.06517 sec     12.8971 GB/s
>>
>> to (-O3 -march=corei7 -mtune-ctrl=avoid_false_dep_for_bmi):
>>
>> unsigned        209717360000    3.14541 sec     16.6683 GB/s
>> uint64_t        209717360000    2.3663 sec      22.1564 GB/s
>>
>> Due to high impact, the new tune flag is enabled by default for Intel
>> tunes and generic:
>>
>> m_SANDYBRIDGE | m_HASWELL | m_INTEL | m_GENERIC
>>
>> 2014-08-16  Uros Bizjak  <ubiz...@gmail.com>
>>
>>     PR target/62011
>>     * config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI):
>>     New tune flag.
>>     * config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BMI): New define.
>>     * config/i386/i386.md (unspec) <UNSPEC_INSN_FALSE_DEP>: New unspec.
>>     (ffs<mode>2): Do not expand with tzcnt for
>>     TARGET_AVOID_FALSE_DEP_FOR_BMI.
>>     (ffssi2_no_cmove): Ditto.
>>     (*tzcnt<mode>_1): Disable for TARGET_AVOID_FALSE_DEP_FOR_BMI.
>>     (ctz<mode>2): New expander.
>>     (*ctz<mode>2_falsedep_1): New insn_and_split pattern.
>>     (*ctz<mode>2_falsedep): New insn.
>>     (*ctz<mode>2): Rename from ctz<mode>2.
>>     (clz<mode>2_lzcnt): New expander.
>>     (*clz<mode>2_lzcnt_falsedep_1): New insn_and_split pattern.
>>     (*clz<mode>2_lzcnt_falsedep): New insn.
>>     (*clz<mode>2): Rename from ctz<mode>2.
>>     (popcount<mode>2): New expander.
>>     (*popcount<mode>2_falsedep_1): New insn_and_split pattern.
>>     (*popcount<mode>2_falsedep): New insn.
>>     (*popcount<mode>2): Rename from ctz<mode>2.
>>     (*popcount<mode>2_cmp): Remove.
>>     (*popcountsi2_cmp_zext): Ditto.
>>
>> The patch was bootstrapped and regression tested on
>> x86_64-pc-linux-gnu {,-m32} and will be committed to mainline SVN
>> after a couple of days. The patch will be also backported to 4.9
>> branch.
>>
>> Uros.
>
> False dependency happens when destination is only updated by tcnt,
> lzcnt or popcnt.  There is no false dependency when destination is
> also used in source.  This patch avoids xor when destination is used

That fact is a (good) news to me.

> in source.  The difference is
>
> @@ -91,15 +91,12 @@ main:
>   .p2align 3
>  .L23:
>   leal 1(%rdx), %ecx
> - xorl %r9d, %r9d
> - xorl %r10d, %r10d
> - popcntq (%rbx,%rax,8), %r10
> - popcntq (%rbx,%rcx,8), %r9
> + popcntq (%rbx,%rax,8), %rax
>   leal 2(%rdx), %r8d
> - movq %r9, %rcx
> + popcntq (%rbx,%rcx,8), %rcx
> + addq %rax, %rcx
>   xorl %eax, %eax
>   leal 3(%rdx), %esi
> - addq %r10, %rcx
>   popcntq (%rbx,%r8,8), %rax
>   addq %rax, %rcx
>   xorl %eax, %eax
>
> and I got
>
> unsigned 41959360000 0.456816 sec 22.954 GB/s
> uint64_t 41959360000 0.408019 sec 25.6992 GB/s
>
> vs
>
> unsigned 41959360000 0.531386 sec 19.7328 GB/s
> uint64_t 41959360000 0.408081 sec 25.6953 GB/s
>
> on Haswell.  OK for trunk?
> 2014-08-18  H.J. Lu  <hongjiu...@intel.com>
>
> * config/i386/i386.md (*ctz<mode>2_falsedep_1): Don't clear
> destination if it is used in source.
> (*clz<mode>2_lzcnt_falsedep_1): Likewise.
> (*popcount<mode>2_falsedep_1): Likewise.

OK with a small nit below, if bootstrapped and regression tested
properly (you didn't state that).

+; False dependency happens when destination is only updated by tcnt,

tzcnt

+; lzcnt or popcnt.  There is no false dependency when destination is

Thanks,
Uros.

Reply via email to