On Tue, Jan 28, 2014 at 12:08 PM, Bingfeng Mei <b...@broadcom.com> wrote: > Thanks, Richard. It is not very clear from documents. > > "Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2) > are vectors with N signed/unsigned elements of size S. Multiply the high/low > or even/odd elements of the two vectors, and put the N/2 products of size 2*S > in the output vector (operand 0)." > > So I thought that implementing both can help vectorizer to optimize more > loops. > Maybe we should improve documents.
Maybe. But my answer was from the top of my head - so better double-check in the vectorizer sources. Richard. > Bingfeng > > > > -----Original Message----- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: 28 January 2014 11:02 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR > in vectorization. > > On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <b...@broadcom.com> wrote: >> Hi, >> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization >> on our port. After a bit investigation, I found following code that prefer >> even|odd version instead of lo|hi one. This is obviously the case for >> AltiVec and maybe some other targets. But even|odd (expanding to a series of >> instructions) versions are less efficient on our target than lo|hi ones. >> Shouldn't there be a target-specific hook to do the choice instead of >> hard-coded one here, or utilizing some cost-estimating technique to compare >> two alternatives? > > Hmm, what's the reason for a target to support both? I think the idea > was that a target only supports either (the more efficient case). > > Richard. > >> /* The result of a vectorized widening operation usually requires >> two vectors (because the widened results do not fit into one >> vector). >> The generated vector results would normally be expected to be >> generated in the same order as in the original scalar computation, >> i.e. if 8 results are generated in each vector iteration, they are >> to be organized as follows: >> vect1: [res1,res2,res3,res4], >> vect2: [res5,res6,res7,res8]. >> >> However, in the special case that the result of the widening >> operation is used in a reduction computation only, the order doesn't >> matter (because when vectorizing a reduction we change the order of >> the computation). Some targets can take advantage of this and >> generate more efficient code. For example, targets like Altivec, >> that support widen_mult using a sequence of {mult_even,mult_odd} >> generate the following vectors: >> vect1: [res1,res3,res5,res7], >> vect2: [res2,res4,res6,res8]. >> >> When vectorizing outer-loops, we execute the inner-loop sequentially >> (each vectorized inner-loop iteration contributes to VF outer-loop >> iterations in parallel). We therefore don't allow to change the >> order of the computation in the inner-loop during outer-loop >> vectorization. */ >> /* TODO: Another case in which order doesn't *really* matter is when we >> widen and then contract again, e.g. (short)((int)x * y >> 8). >> Normally, pack_trunc performs an even/odd permute, whereas the >> repack from an even/odd expansion would be an interleave, which >> would be significantly simpler for e.g. AVX2. */ >> /* In any case, in order to avoid duplicating the code below, recurse >> on VEC_WIDEN_MULT_EVEN_EXPR. If it succeeds, all the return values >> are properly set up for the caller. If we fail, we'll continue with >> a VEC_WIDEN_MULT_LO/HI_EXPR check. */ >> if (vect_loop >> && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction >> && !nested_in_vect_loop_p (vect_loop, stmt) >> && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR, >> stmt, vectype_out, vectype_in, >> code1, code2, multi_step_cvt, >> interm_types)) >> return true; >> >> >> Thanks, >> Bingfeng Mei