On Mon, Jul 11, 2022, H.J. Lu <hjl.to...@gmail.com> wrote:
> On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <ro...@nextmovesoftware.com>
> wrote:
> > Hi HJ,
> >
> > I believe this should now be handled by the post-reload (CSE) pass.
> > Consider the simple test case:
> >
> > __int128 a, b, c;
> > void foo()
> > {
> >   a = 0;
> >   b = 0;
> >   c = 0;
> > }
> >
> > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
> >         movq    $0, a(%rip)
> >         movq    $0, a+8(%rip)
> >         movq    $0, b(%rip)
> >         movq    $0, b+8(%rip)
> >         movq    $0, c(%rip)
> >         movq    $0, c+8(%rip)
> >         ret
> >
> > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, a(%rip)
> >         movaps  %xmm0, b(%rip)
> >         movaps  %xmm0, c(%rip)
> >         ret
> >
> > You're quite right internally the STV actually generates the equivalent of:
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, a(%rip)
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, b(%rip)
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, c(%rip)
> >         ret
> >
> > And currently because STV run before cse2 and combine, the const0_rtx
> > gets CSE'd be the cse2 pass to produce the code we see.  However, if
> > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass),
> > you'll see we continue to generate the same optimized code, as the
> > same const0_rtx gets CSE'd in postreload.
> >
> > I can't be certain until I try the experiment, but I believe that the
> > postreload CSE will clean-up, all of the same common subexpressions.
> > Hence, it should be safe to perform all STV at the same point (after
> > combine), which for a few additional optimizations.
> >
> > Does this make sense?  Do you have a test case,
> > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV
> chains?
> >
> > My guess is that the RTL passes have changed so much in the last six
> > or seven years, that some of the original motivation no longer applies.
> > Certainly we now try to keep TI mode operations visible longer, and
> > then allow STV to behave like a pre-reload pass to decide which set of
> > registers to use (vector V1TI or scalar doubleword DI).  Any CSE
> > opportunities that cse2 finds with V1TI mode, could/should equally
> > well be found for TI mode (mostly).
> 
> You are probably right.  If there are no regressions in GCC testsuite, my 
> original
> motivation is no longer valid.

It was good to try the experiment, but H.J. is right, there is still some 
benefit
(as well as some disadvantages)  to running STV lowering before CSE2/combine.
A clean-up patch to perform all STV conversion as a single pass (removing a
pass from the compiler) results in just a single regression in the test suite:
FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8
which looks like:

__int128 a, b, c, d, e, f;
void foo (void)
{
  a = 0;
  b = -1;
  c = 0;
  d = -1;
  e = 0;
  f = -1;
}

By performing STV after combine (without CSE), reload prefers to implement
this function using a single register, that then requires 12 instructions rather
than 8 (if using two registers).  Alas there's nothing that postreload CSE/GCSE
can do.  Doh!

        pxor    %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, b(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, c(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, d(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, e(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, f(%rip)
        ret

I also note that even without STV, the scalar implementation of this function 
when
compiled with -Os is also larger than it needs to be due to poor CSE (notice in 
the
following we only need a single zero register, and  an all_ones reg would be 
helpful).

        xorl    %eax, %eax
        xorl    %edx, %edx
        xorl    %ecx, %ecx
        movq    $-1, b(%rip)
        movq    %rax, a(%rip)
        movq    %rax, a+8(%rip)
        movq    $-1, b+8(%rip)
        movq    %rdx, c(%rip)
        movq    %rdx, c+8(%rip)
        movq    $-1, d(%rip)
        movq    $-1, d+8(%rip)
        movq    %rcx, e(%rip)
        movq    %rcx, e+8(%rip)
        movq    $-1, f(%rip)
        movq    $-1, f+8(%rip)
        ret

I need to give the problem some more thought.  It would be good to 
clean-up/unify
the STV passes, but I/we need to solve/CSE HJ's last test case before we do.  
Perhaps
by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the 
zero
register to become visible, and CSE'd, benefiting both vector code and scalar 
-Os code,
then use postreload/peephole2 to fix up the remaining scalar cases.  It's 
tricky.

Cheers,
Roger
--


Reply via email to