https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120941
--- Comment #18 from H.J. Lu <hjl.tools at gmail dot com> --- (In reply to Filip Kastl from comment #17) > This is the replacement that causes the slowdown (well, two replacements): > > ---------------------- > Replace: > > (insn 2224 2222 2228 20 (set (reg:V4DF 1604) > (vec_duplicate:V4DF (mem/u/c:DF (symbol_ref/u:DI ("*.LC3") [flags > 0x2]) [0 S8 A64]))) 9260 {vec_dupv4df} > (expr_list:REG_EQUAL (const_vector:V4DF [ > (const_double:DF > 2.7777777777777776235801354687282582744956016540527344e-2 > [0x0.e38e38e38e38ep-5]) repeated x4 > ]) > (nil))) > > with: > > (insn 2224 2222 2228 20 (set (reg:V4DF 1604) > (reg:V4DF 1655)) 2428 {movv4df_internal} > (expr_list:REG_EQUAL (const_vector:V4DF [ > (const_double:DF > 2.7777777777777776235801354687282582744956016540527344e-2 > [0x0.e38e38e38e38ep-5]) repeated x4 > ]) > (nil))) > > deferring rescan insn with uid = 2224. > > Replace: > > (insn 2228 2224 377 20 (set (reg:V2DF 1603) > (vec_duplicate:V2DF (mem/u/c:DF (symbol_ref/u:DI ("*.LC3") [flags > 0x2]) [0 S8 A64]))) 7168 {vec_dupv2df} > (expr_list:REG_EQUAL (const_vector:V2DF [ > (const_double:DF > 2.7777777777777776235801354687282582744956016540527344e-2 > [0x0.e38e38e38e38ep-5]) repeated x2 > ]) > (nil))) > > with: > > (insn 2228 2224 377 20 (set (reg:V2DF 1603) > (subreg:V2DF (reg:V4DF 1655) 0)) 2429 {movv2df_internal} > (expr_list:REG_EQUAL (const_vector:V2DF [ > (const_double:DF > 2.7777777777777776235801354687282582744956016540527344e-2 > [0x0.e38e38e38e38ep-5]) repeated x2 > ]) > (nil))) > > deferring rescan insn with uid = 2228. > ---------------------- > > These instructions are inside function "main". Though, the last RTL debug > instruction is > > (debug_insn 272 271 273 19 (debug_marker) "lbm.c":275:2 discrim 1 -1 > (nil)) > > so I expect that function "LBM_performStreamCollideTRT" was inlined into > main and is the original source of these vector instructions. > > Hopefully this helps. If you meant something else by "testcase", do tell me. > > > What I did in more detail: > > I used a custom debug counter. If I set the 9-th call of > ix86_broadcast_inner() to return null (I adapted what hjl's patch does), I > get rid of the slowdown. > > On r16-1644-gaba3b9d3a48a07 I added the debug counter and did: > > /home/fkastl/gcc/inst/bin/gcc -std=gnu99 -m64 -c -o lbm.o -DSPEC -DNDEBUG > -DSPEC_AUTO_SUPPRESS_OPENMP -Ofast -march=native -mtune=native -g -flto=32 > -fpermissive -std=gnu17 -DSPEC_LP64 lbm.c > -fdbg-cnt=foo_counter:1000000000-1000000000 > /home/fkastl/gcc/inst/bin/gcc -std=gnu99 -m64 -c -o main.o -DSPEC -DNDEBUG > -DSPEC_AUTO_SUPPRESS_OPENMP -Ofast -march=native -mtune=native -g -flto=32 > -fpermissive -std=gnu17 -DSPEC_LP64 main.c > -fdbg-cnt=foo_counter:1000000000-1000000000 > > /home/fkastl/gcc/inst/bin/gcc -std=gnu99 -m64 > -Wl,-rpath,/home/fkastl/gcc/inst/lib64 -Ofast -march=native -mtune=native > -g -flto=32 -fpermissive -std=gnu17 lbm.o main.o -lm -o > lbm_r -fdbg-cnt=foo_counter:9-9 -fdump-rtl-all > > -> 3m43s > > /home/fkastl/gcc/inst/bin/gcc -std=gnu99 -m64 > -Wl,-rpath,/home/fkastl/gcc/inst/lib64 -Ofast -march=native -mtune=native > -g -flto=32 -fpermissive -std=gnu17 lbm.o main.o -lm -o > lbm_r -fdbg-cnt=foo_counter:1000000000-1000000000 -fdump-rtl-all > > -> 2m50s > > Then I compared the *.rrvl rtl dumps. Btw I had to "backport" the > > Replace: > ... > with: > > and > > Add: > ... > > dumping from a newer commit. Please extract something I can use it to reproduce.