On Fri, Mar 4, 2022 at 11:30 AM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> This clean-up patch resolves PR testsuite/104732, the failure of the recent
> test gcc.target/i386/pr100711-1.c on 32-bit Solaris/x86.  Rather than just
> tweak the testcase, the proposed approach is to fix the underlying problem
> by removing the "TARGET_STV && TARGET_SSE2" conditionals from the DI mode
> logical operation expanders and pre-reload splitters in i386.md, which as
> I'll show generate inferior code (even a GCC 12 regression) on !TARGET_64BIT
> whenever -mno-stv (such as Solaris) or -msse (but not -msse2).
>
> First a little bit of history.  In the beginning, DImode operations on
> i386 weren't defined by the machine description, and lowered during RTL
> expansion to SI mode operations.  The with PR 65105 in 2015, -mstv was
> added, together with a SWIM1248x mode iterator (later renamed to SWIM1248x)
> together with several *<code>di3_doubleword post-reload splitters that
> made use of register allocation to perform some double word operations
> in 64-but XMM registers.  A short while later in 2016, PR 70322 added
> similar support for one_cmpldi2.  All of this logic was dependent upon
> "!TARGET_64BIT && TARGET_STV && TARGET_SSE2".  With the passing of time,
> these conditions became irrelevant when in 2019, it was decided to split
> these double-word patterns before reload.
> https://gcc.gnu.org/pipermail/gcc-patches/2019-June/523877.html
> https://gcc.gnu.org/pipermail/gcc-patches/2019-October/532236.html
> Hence the current situation, where on most modern CPU architectures
> (where "TARGET_STV && TARGET_SSE2" is true), RTL is expanded with DI
> mode operations, that are then split into two SI mode instructions
> before reload, except on Solaris and other odd cases, where the splitting
> is to two SI mode instructions is done during RTL expansion.  By the
> time compilation reaches register allocation both paths in theory
> produce identical or similar code, so the vestigial legacy/logic would
> appear to be harmless.
>
> Unfortunately, there is one place where this arbitrary choice of how
> to lower DI mode doubleword operations is visible to the middle-end,
> it controls whether the backend appears to have a suitable optab, and
> the presence (or not) of DImode optabs can influence vectorization
> cost models and veclower decisions.
>
> The issue (and code quality regression) can be seen in this test case:
>
> typedef long long v2di __attribute__((vector_size (16)));
> v2di x;
> void foo (long long a)
> {
>     v2di t = {a, a};
>     x = ~t;
> }
>
> which when compiled with "-O2 -m32 -msse -march=pentiumpro" produces:
>
> foo:    subl    $28, %esp
>         movl    %ebx, 16(%esp)
>         movl    32(%esp), %eax
>         movl    %esi, 20(%esp)
>         movl    36(%esp), %edx
>         movl    %edi, 24(%esp)
>         movl    %eax, %esi
>         movl    %eax, %edi
>         movl    %edx, %ebx
>         movl    %edx, %ecx
>         notl    %esi
>         notl    %ebx
>         movl    %esi, (%esp)
>         notl    %edi
>         notl    %ecx
>         movl    %ebx, 4(%esp)
>         movl    20(%esp), %esi
>         movl    %edi, 8(%esp)
>         movl    16(%esp), %ebx
>         movl    %ecx, 12(%esp)
>         movl    24(%esp), %edi
>         movss   8(%esp), %xmm1
>         movss   12(%esp), %xmm2
>         movss   (%esp), %xmm0
>         movss   4(%esp), %xmm3
>         unpcklps        %xmm2, %xmm1
>         unpcklps        %xmm3, %xmm0
>         movlhps %xmm1, %xmm0
>         movaps  %xmm0, x
>         addl    $28, %esp
>         ret
>
>
> Importantly notice the four "notl" instructions.  With this patch:
>
> foo:    subl    $28, %esp
>         movl    32(%esp), %edx
>         movl    36(%esp), %eax
>         notl    %edx
>         movl    %edx, (%esp)
>         notl    %eax
>         movl    %eax, 4(%esp)
>         movl    %edx, 8(%esp)
>         movl    %eax, 12(%esp)
>         movaps  (%esp), %xmm1
>         movaps  %xmm1, x
>         addl    $28, %esp
>         ret
>
> Notice only two "notl" instructions.  Checking with Godbolt.org, GCC
> generated 4 NOTs in GCC 4.x and 5.x, 2 NOTs between GCC 6.x and 9.x,
> and regressed to 4 NOTs since GCC 10.x [which hopefully qualifies
> this clean-up as suitable for stage 4].
>
> Most significantly, this patch allows pr100711-1.c to pass with
> -mno-stv, allowing pandn to be used with V2DImode on Solaris/x86.
> Fingers-crossed this should reduce the number of discrepancies
> that Rainer Orth encounters supporting Solaris/x86.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, with/without --target_board='unix{-m32\ -march=
> cascadelake}' with no new failures.  Ok for mainline?

The idea was to leave decomposition of double-word operations to the
generic middle-end, where simplification and propagation of constants
will be handled in a generic way. However, several releases later,
these simplifications were also introduced to STV-enabeld patterns.
So, if the latter approach is demonstrably better, then I see no
problem to enable it for all targets, not only STV-enabled ones.

>
> 2022-03-04  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR testsuite/104732
>         * config/i386/i386.md (SWIM1248s): Include DI mode unconditionally.
>         (*anddi3_doubleword): Remove && TARGET_STV && TARGET_SSE2 condition,
>         i.e. always split on !TARGET_64BIT.
>         (*<any_or>di3_doubleword): Likewise.
>         (*one_cmpldi2_doubleword): Likewise.
>
> gcc/testsuite/ChangeLog
>         PR testsuite/104732
>         * gcc.target/i386/pr104732.c: New test case.

OK with a nit below.

Thanks,
Uros.

-;; Math-dependant integer modes with DImode (enabled for 32bit with STV).
+;; Math-dependant integer modes with DImode.
 (define_mode_iterator SWIM1248s
  [(QI "TARGET_QIMODE_MATH")
  (HI "TARGET_HIMODE_MATH")
- SI (DI "TARGET_64BIT || (TARGET_STV && TARGET_SSE2)")])
+ SI DI])

Please rename this iterator to SWIM1248x, as is the case with other
iterators where DImode is used unconditionally.

Reply via email to