https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63175

--- Comment #25 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 3 Mar 2015, msebor at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63175
> 
> --- Comment #24 from Martin Sebor <msebor at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #16)
> > Why is the loop bound to i != 16 / sizeof *s?
> 
> The upper bound is intended to make the copied sequence fit into one vector
> register, irrespective of the size of the array element.
> 
> The vector load and store instructions tolerate unaligned accesses and there
> are permute instructions that combine the contents of two vector registers 
> into
> a single one to compensate for unaligned reads or writes.  I'm not sure it
> makes sense to expect unaligned copies involving a single vector register's
> worth of data to be vectorized (as done in my proposed tests for char and
> short), but I would expect larger unaligned copies (i.e., multiples of 16
> bytes) to benefit from it.  In my experiments I've seen no evidence of GCC
> attempting to vectorize such copies but I need to do some more research to
> understand why.
> 
> (In reply to comment #23)
> 
> The test uses -maltivec and that's what I've been using as well.  But I 
> see in the Power ISA book that lxvw4x and stxvw4x are classified as VSX 
> instructions, so perhaps they shouldn't be emitted without -mvsx.  
> Although 5.0 doesn't emit them even with -vsx.

5.0 doesn't consider stxvw4x without -mvsx - it does so with but then
the vectorizer cost model says the vectorization is not profitable:

t.c:10:10: note: Cost model analysis:
  Vector inside of basic block cost: 29
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 8
t.c:10:10: note: not vectorized: vectorization is not profitable.

I'll see if that cost caluclation is sensible.  We have 2 aligned
vector loads (cost 2), one permute (cost 3), one vector stmt (cost 1),
one unaligned store (unknown misalignment) which hits

rs6000_builtin_vectorization_cost (type_of_cost=unaligned_store, 
    vectype=<vector_type 0x7ffff6a39888>, misalign=-1)
    at /space/rguenther/src/svn/trunk2/gcc/config/rs6000/rs6000.c:4376
4376      switch (type_of_cost)
...
4455                        case -1:
4456                          /* Unknown misalignment.  */
4457                        case 4:
4458                        case 12:
4459                          /* Word aligned.  */
4460                          return 23;

cost of 23!(??).  For a misalign of 4?

Well - there you have it.  For the testcase

#define T int
extern const T a [];
T b[8];

void g (void)
{
  const T *p = a + 1;
  T *q = b + 1;

  *q++ = *p++;
  *q++ = *p++;
  *q++ = *p++;
  *q++ = *p++;
}

Eventually 4.8 had the cost model turned off for the testsuite or
it had bugs and misrepresented the case.  But clearly a cost of
23 looks excessive to me here (the scalar store of one of the 4
elements has cost 1!  so the unaligned vector store is nearly
6 times more expensive than doing the 4 unaligned stores.  Nobody
would design an instruction with such a severe penalty).

With -fvect-cost-model=unlimited GCC 5 produces

.L.g:
        addis 9,2,.LC0@toc@ha           # gpr load fusion, type long
        ld 9,.LC0@toc@l(9)
        addis 8,2,.LANCHOR0@toc@ha
        addi 8,8,.LANCHOR0@toc@l
        addi 10,9,12
        neg 7,9
        rldicr 10,10,0,59
        rldicr 9,9,0,59
        lvsr 13,0,7
        lxvw4x 33,0,9
        lxvw4x 32,0,10
        li 9,4
        vperm 0,1,0,13
        stxvw4x 32,8,9
        blr

Ah, GCC 4.8 had the cost model disabled by default (at least for
basic-block vectorization), so you need to enable it via
-fvect-cost-model where it rejects vectorizing the above with the
same reasoning.

So there is no regression and if vectorization is profitable then
the backend needs to adjust its cost model.

Reply via email to