On Mon, Jul 11, 2022, H.J. Lu <hjl.to...@gmail.com> wrote: > On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <ro...@nextmovesoftware.com> > wrote: > > Hi HJ, > > > > I believe this should now be handled by the post-reload (CSE) pass. > > Consider the simple test case: > > > > __int128 a, b, c; > > void foo() > > { > > a = 0; > > b = 0; > > c = 0; > > } > > > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes: > > movq $0, a(%rip) > > movq $0, a+8(%rip) > > movq $0, b(%rip) > > movq $0, b+8(%rip) > > movq $0, c(%rip) > > movq $0, c+8(%rip) > > ret > > > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode: > > pxor %xmm0, %xmm0 > > movaps %xmm0, a(%rip) > > movaps %xmm0, b(%rip) > > movaps %xmm0, c(%rip) > > ret > > > > You're quite right internally the STV actually generates the equivalent of: > > pxor %xmm0, %xmm0 > > movaps %xmm0, a(%rip) > > pxor %xmm0, %xmm0 > > movaps %xmm0, b(%rip) > > pxor %xmm0, %xmm0 > > movaps %xmm0, c(%rip) > > ret > > > > And currently because STV run before cse2 and combine, the const0_rtx > > gets CSE'd be the cse2 pass to produce the code we see. However, if > > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass), > > you'll see we continue to generate the same optimized code, as the > > same const0_rtx gets CSE'd in postreload. > > > > I can't be certain until I try the experiment, but I believe that the > > postreload CSE will clean-up, all of the same common subexpressions. > > Hence, it should be safe to perform all STV at the same point (after > > combine), which for a few additional optimizations. > > > > Does this make sense? Do you have a test case, > > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV > chains? > > > > My guess is that the RTL passes have changed so much in the last six > > or seven years, that some of the original motivation no longer applies. > > Certainly we now try to keep TI mode operations visible longer, and > > then allow STV to behave like a pre-reload pass to decide which set of > > registers to use (vector V1TI or scalar doubleword DI). Any CSE > > opportunities that cse2 finds with V1TI mode, could/should equally > > well be found for TI mode (mostly). > > You are probably right. If there are no regressions in GCC testsuite, my > original > motivation is no longer valid.
It was good to try the experiment, but H.J. is right, there is still some benefit (as well as some disadvantages) to running STV lowering before CSE2/combine. A clean-up patch to perform all STV conversion as a single pass (removing a pass from the compiler) results in just a single regression in the test suite: FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8 which looks like: __int128 a, b, c, d, e, f; void foo (void) { a = 0; b = -1; c = 0; d = -1; e = 0; f = -1; } By performing STV after combine (without CSE), reload prefers to implement this function using a single register, that then requires 12 instructions rather than 8 (if using two registers). Alas there's nothing that postreload CSE/GCSE can do. Doh! pxor %xmm0, %xmm0 movaps %xmm0, a(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, b(%rip) pxor %xmm0, %xmm0 movaps %xmm0, c(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, d(%rip) pxor %xmm0, %xmm0 movaps %xmm0, e(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, f(%rip) ret I also note that even without STV, the scalar implementation of this function when compiled with -Os is also larger than it needs to be due to poor CSE (notice in the following we only need a single zero register, and an all_ones reg would be helpful). xorl %eax, %eax xorl %edx, %edx xorl %ecx, %ecx movq $-1, b(%rip) movq %rax, a(%rip) movq %rax, a+8(%rip) movq $-1, b+8(%rip) movq %rdx, c(%rip) movq %rdx, c+8(%rip) movq $-1, d(%rip) movq $-1, d+8(%rip) movq %rcx, e(%rip) movq %rcx, e+8(%rip) movq $-1, f(%rip) movq $-1, f+8(%rip) ret I need to give the problem some more thought. It would be good to clean-up/unify the STV passes, but I/we need to solve/CSE HJ's last test case before we do. Perhaps by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the zero register to become visible, and CSE'd, benefiting both vector code and scalar -Os code, then use postreload/peephole2 to fix up the remaining scalar cases. It's tricky. Cheers, Roger --