https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120941
Filip Kastl <pheeck at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|WAITING |NEW --- Comment #17 from Filip Kastl <pheeck at gcc dot gnu.org> --- This is the replacement that causes the slowdown (well, two replacements): ---------------------- Replace: (insn 2224 2222 2228 20 (set (reg:V4DF 1604) (vec_duplicate:V4DF (mem/u/c:DF (symbol_ref/u:DI ("*.LC3") [flags 0x2]) [0 S8 A64]))) 9260 {vec_dupv4df} (expr_list:REG_EQUAL (const_vector:V4DF [ (const_double:DF 2.7777777777777776235801354687282582744956016540527344e-2 [0x0.e38e38e38e38ep-5]) repeated x4 ]) (nil))) with: (insn 2224 2222 2228 20 (set (reg:V4DF 1604) (reg:V4DF 1655)) 2428 {movv4df_internal} (expr_list:REG_EQUAL (const_vector:V4DF [ (const_double:DF 2.7777777777777776235801354687282582744956016540527344e-2 [0x0.e38e38e38e38ep-5]) repeated x4 ]) (nil))) deferring rescan insn with uid = 2224. Replace: (insn 2228 2224 377 20 (set (reg:V2DF 1603) (vec_duplicate:V2DF (mem/u/c:DF (symbol_ref/u:DI ("*.LC3") [flags 0x2]) [0 S8 A64]))) 7168 {vec_dupv2df} (expr_list:REG_EQUAL (const_vector:V2DF [ (const_double:DF 2.7777777777777776235801354687282582744956016540527344e-2 [0x0.e38e38e38e38ep-5]) repeated x2 ]) (nil))) with: (insn 2228 2224 377 20 (set (reg:V2DF 1603) (subreg:V2DF (reg:V4DF 1655) 0)) 2429 {movv2df_internal} (expr_list:REG_EQUAL (const_vector:V2DF [ (const_double:DF 2.7777777777777776235801354687282582744956016540527344e-2 [0x0.e38e38e38e38ep-5]) repeated x2 ]) (nil))) deferring rescan insn with uid = 2228. ---------------------- These instructions are inside function "main". Though, the last RTL debug instruction is (debug_insn 272 271 273 19 (debug_marker) "lbm.c":275:2 discrim 1 -1 (nil)) so I expect that function "LBM_performStreamCollideTRT" was inlined into main and is the original source of these vector instructions. Hopefully this helps. If you meant something else by "testcase", do tell me. What I did in more detail: I used a custom debug counter. If I set the 9-th call of ix86_broadcast_inner() to return null (I adapted what hjl's patch does), I get rid of the slowdown. On r16-1644-gaba3b9d3a48a07 I added the debug counter and did: /home/fkastl/gcc/inst/bin/gcc -std=gnu99 -m64 -c -o lbm.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -Ofast -march=native -mtune=native -g -flto=32 -fpermissive -std=gnu17 -DSPEC_LP64 lbm.c -fdbg-cnt=foo_counter:1000000000-1000000000 /home/fkastl/gcc/inst/bin/gcc -std=gnu99 -m64 -c -o main.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -Ofast -march=native -mtune=native -g -flto=32 -fpermissive -std=gnu17 -DSPEC_LP64 main.c -fdbg-cnt=foo_counter:1000000000-1000000000 /home/fkastl/gcc/inst/bin/gcc -std=gnu99 -m64 -Wl,-rpath,/home/fkastl/gcc/inst/lib64 -Ofast -march=native -mtune=native -g -flto=32 -fpermissive -std=gnu17 lbm.o main.o -lm -o lbm_r -fdbg-cnt=foo_counter:9-9 -fdump-rtl-all -> 3m43s /home/fkastl/gcc/inst/bin/gcc -std=gnu99 -m64 -Wl,-rpath,/home/fkastl/gcc/inst/lib64 -Ofast -march=native -mtune=native -g -flto=32 -fpermissive -std=gnu17 lbm.o main.o -lm -o lbm_r -fdbg-cnt=foo_counter:1000000000-1000000000 -fdump-rtl-all -> 2m50s Then I compared the *.rrvl rtl dumps. Btw I had to "backport" the Replace: ... with: and Add: ... dumping from a newer commit.