Hi,
I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on 
our port. After a bit investigation, I found following code that prefer 
even|odd version instead of lo|hi one. This is obviously the case for AltiVec 
and maybe some other targets. But even|odd (expanding to a series of 
instructions) versions are less efficient on our target than lo|hi ones. 
Shouldn't there be a target-specific hook to do the choice instead of 
hard-coded one here, or utilizing some cost-estimating technique to compare two 
alternatives?

     /* The result of a vectorized widening operation usually requires
         two vectors (because the widened results do not fit into one vector).
         The generated vector results would normally be expected to be
         generated in the same order as in the original scalar computation,
         i.e. if 8 results are generated in each vector iteration, they are
         to be organized as follows:
                vect1: [res1,res2,res3,res4],
                vect2: [res5,res6,res7,res8].

         However, in the special case that the result of the widening
         operation is used in a reduction computation only, the order doesn't
         matter (because when vectorizing a reduction we change the order of
         the computation).  Some targets can take advantage of this and
         generate more efficient code.  For example, targets like Altivec,
         that support widen_mult using a sequence of {mult_even,mult_odd}
         generate the following vectors:
                vect1: [res1,res3,res5,res7],
                vect2: [res2,res4,res6,res8].

         When vectorizing outer-loops, we execute the inner-loop sequentially
         (each vectorized inner-loop iteration contributes to VF outer-loop
         iterations in parallel).  We therefore don't allow to change the
         order of the computation in the inner-loop during outer-loop
         vectorization.  */
      /* TODO: Another case in which order doesn't *really* matter is when we
         widen and then contract again, e.g. (short)((int)x * y >> 8).
         Normally, pack_trunc performs an even/odd permute, whereas the 
         repack from an even/odd expansion would be an interleave, which
         would be significantly simpler for e.g. AVX2.  */
      /* In any case, in order to avoid duplicating the code below, recurse
         on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
         are properly set up for the caller.  If we fail, we'll continue with
         a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
      if (vect_loop
          && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
          && !nested_in_vect_loop_p (vect_loop, stmt)
          && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
                                             stmt, vectype_out, vectype_in,
                                             code1, code2, multi_step_cvt,
                                             interm_types))
        return true;


Thanks,
Bingfeng Mei

Reply via email to