On Thu, Apr 28, 2016 at 12:36 PM, Ilya Enkovich <enkovich....@gmail.com> wrote: > 2016-04-27 22:58 GMT+03:00 Uros Bizjak <ubiz...@gmail.com>: >> Hello! >> >> This RFC patch illustrates the idea of using STV pass to load/store >> any TImode constant using SSE insns. The testcase: >> >> --cut here-- >> __int128 x; >> >> __int128 test_1 (void) >> { >> x = (__int128) 0x00112233; >> } >> >> __int128 test_2 (void) >> { >> x = ((__int128) 0x0011223344556677 << 64); >> } >> >> __int128 test_3 (void) >> { >> x = ((__int128) 0x0011223344556677 << 64) + (__int128) 0x0011223344556677; >> } >> --cut here-- >> >> currently compiles (-O2) on x86_64 to: >> >> test_1: >> movq $1122867, x(%rip) >> movq $0, x+8(%rip) >> ret >> >> test_2: >> xorl %eax, %eax >> movabsq $4822678189205111, %rdx >> movq %rax, x(%rip) >> movq %rdx, x+8(%rip) >> ret >> >> test_3: >> movabsq $4822678189205111, %rax >> movabsq $4822678189205111, %rdx >> movq %rax, x(%rip) >> movq %rdx, x+8(%rip) >> ret >> >> However, using the attached patch, we compile all tests to: >> >> test: >> movdqa .LC0(%rip), %xmm0 >> movaps %xmm0, x(%rip) >> ret >> >> Ilya, HJ - do you think new sequences are better, or - as suggested by >> Jakub - they are beneficial with STV pass, as we are now able to load >> any immediate value? A variant of this patch can also be used to load >> DImode values to 32bit STV pass. >> >> Uros. > > Hi, > > Why don't we have two movq instructions in all three cases now? Is it > because of late split?
movq can handle only 32bit sign-extended immediates. There is actually room for improvement in test_2, where we could directly store 0 to x(%rip). Uros. > I wouldn't say SSE load+store is always better than two movq instructions. > But it obviously can enable bigger chains for STV which is good. I think > you should adjust a cost model to handle immediates for proper decision. > > That's what I have in my draft for DImode immediates: > > @@ -3114,6 +3123,20 @@ scalar_chain::build (bitmap candidates, > unsigned insn_uid) > BITMAP_FREE (queue); > } > > +/* Return a cost of building a vector costant > + instead of using a scalar one. */ > + > +int > +scalar_chain::vector_const_cost (rtx exp) > +{ > + gcc_assert (CONST_INT_P (exp)); > + > + if (const0_operand (exp, GET_MODE (exp)) > + || constm1_operand (exp, GET_MODE (exp))) > + return COSTS_N_INSNS (1); > + return ix86_cost->sse_load[1]; > +} > + > /* Compute a gain for chain conversion. */ > > int > @@ -3145,11 +3168,25 @@ scalar_chain::compute_convert_gain () > || GET_CODE (src) == IOR > || GET_CODE (src) == XOR > || GET_CODE (src) == AND) > - gain += ix86_cost->add; > + { > + gain += ix86_cost->add; > + if (CONST_INT_P (XEXP (src, 0))) > + gain -= scalar_chain::vector_const_cost (XEXP (src, 0)); > + if (CONST_INT_P (XEXP (src, 1))) > + gain -= scalar_chain::vector_const_cost (XEXP (src, 1)); > + } > else if (GET_CODE (src) == COMPARE) > { > /* Assume comparison cost is the same. */ > } > + else if (GET_CODE (src) == CONST_INT) > + { > + if (REG_P (dst)) > + gain += COSTS_N_INSNS (2); > + else if (MEM_P (dst)) > + gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > + gain -= scalar_chain::vector_const_cost (src); > + } > else > gcc_unreachable ();