2016-04-28 13:43 GMT+03:00 Uros Bizjak <ubiz...@gmail.com>: > On Thu, Apr 28, 2016 at 12:36 PM, Ilya Enkovich <enkovich....@gmail.com> > wrote: >> 2016-04-27 22:58 GMT+03:00 Uros Bizjak <ubiz...@gmail.com>: >>> Hello! >>> >>> This RFC patch illustrates the idea of using STV pass to load/store >>> any TImode constant using SSE insns. The testcase: >>> >>> --cut here-- >>> __int128 x; >>> >>> __int128 test_1 (void) >>> { >>> x = (__int128) 0x00112233; >>> } >>> >>> __int128 test_2 (void) >>> { >>> x = ((__int128) 0x0011223344556677 << 64); >>> } >>> >>> __int128 test_3 (void) >>> { >>> x = ((__int128) 0x0011223344556677 << 64) + (__int128) 0x0011223344556677; >>> } >>> --cut here-- >>> >>> currently compiles (-O2) on x86_64 to: >>> >>> test_1: >>> movq $1122867, x(%rip) >>> movq $0, x+8(%rip) >>> ret >>> >>> test_2: >>> xorl %eax, %eax >>> movabsq $4822678189205111, %rdx >>> movq %rax, x(%rip) >>> movq %rdx, x+8(%rip) >>> ret >>> >>> test_3: >>> movabsq $4822678189205111, %rax >>> movabsq $4822678189205111, %rdx >>> movq %rax, x(%rip) >>> movq %rdx, x+8(%rip) >>> ret >>> >>> However, using the attached patch, we compile all tests to: >>> >>> test: >>> movdqa .LC0(%rip), %xmm0 >>> movaps %xmm0, x(%rip) >>> ret >>> >>> Ilya, HJ - do you think new sequences are better, or - as suggested by >>> Jakub - they are beneficial with STV pass, as we are now able to load >>> any immediate value? A variant of this patch can also be used to load >>> DImode values to 32bit STV pass. >>> >>> Uros. >> >> Hi, >> >> Why don't we have two movq instructions in all three cases now? Is it >> because of late split? > > movq can handle only 32bit sign-extended immediates. There is actually > room for improvement in test_2, where we could directly store 0 to > x(%rip).
Right. In this case timode_scalar_chain::compute_convert_gain should analyze immediate values used in a chain. Thanks, Ilya > > Uros. > >> I wouldn't say SSE load+store is always better than two movq instructions. >> But it obviously can enable bigger chains for STV which is good. I think >> you should adjust a cost model to handle immediates for proper decision. >> >> That's what I have in my draft for DImode immediates: >> >> @@ -3114,6 +3123,20 @@ scalar_chain::build (bitmap candidates, >> unsigned insn_uid) >> BITMAP_FREE (queue); >> } >> >> +/* Return a cost of building a vector costant >> + instead of using a scalar one. */ >> + >> +int >> +scalar_chain::vector_const_cost (rtx exp) >> +{ >> + gcc_assert (CONST_INT_P (exp)); >> + >> + if (const0_operand (exp, GET_MODE (exp)) >> + || constm1_operand (exp, GET_MODE (exp))) >> + return COSTS_N_INSNS (1); >> + return ix86_cost->sse_load[1]; >> +} >> + >> /* Compute a gain for chain conversion. */ >> >> int >> @@ -3145,11 +3168,25 @@ scalar_chain::compute_convert_gain () >> || GET_CODE (src) == IOR >> || GET_CODE (src) == XOR >> || GET_CODE (src) == AND) >> - gain += ix86_cost->add; >> + { >> + gain += ix86_cost->add; >> + if (CONST_INT_P (XEXP (src, 0))) >> + gain -= scalar_chain::vector_const_cost (XEXP (src, 0)); >> + if (CONST_INT_P (XEXP (src, 1))) >> + gain -= scalar_chain::vector_const_cost (XEXP (src, 1)); >> + } >> else if (GET_CODE (src) == COMPARE) >> { >> /* Assume comparison cost is the same. */ >> } >> + else if (GET_CODE (src) == CONST_INT) >> + { >> + if (REG_P (dst)) >> + gain += COSTS_N_INSNS (2); >> + else if (MEM_P (dst)) >> + gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; >> + gain -= scalar_chain::vector_const_cost (src); >> + } >> else >> gcc_unreachable ();