https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65660

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Looks like we now vectorize using loop vect instead of basic-block
vectorization.  The overhead might be noticable.  For example

 ./ggSpectrum.h:48:4: note: loop vectorized
-./ggSpectrum.h:49:18: note: basic block vectorized
-./ggSpectrum.h:49:18: note: basic block vectorized
-ggPathDielectricMaterial.cc:36:60: note: basic block vectorized
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment

is all from ggPathDielectricMaterial.cc.

Not sure why we peel for alignment at all as bdver2 has
vec_align_load_cost == vec_unalign_load_cost == vec_store_cost (there isn't
any unaligned store cost but IIRC an unalinged store consumes two store buffers
thus aligning the stores might be profitable).

Btw, the loop in question is:

    void Set(float d) {
   for (int i = 0; i < nComponents(); i++)
      data[i] = d;
}

where I can very well imagine that nComponents() is _not_ large enough to
warrant loop vectorization (data is an array of 8 floats).  nComponents()
returns constant 8.

With bdver2 we now have

t.c:4:20: note: vectorization_factor = 4, niters = 8
t.c:4:20: note: === vect_update_slp_costs_according_to_vf ===
cost model: prologue peel iters set to vf/2.
cost model: epilogue peel iters set to vf/2 because peeling for alignment is
unknown.
t.c:4:20: note: Cost model analysis:
  Vector inside of loop cost: 4
  Vector prologue cost: 8
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 8
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 2
t.c:4:20: note:   Runtime profitability threshold = 3
t.c:4:20: note:   Static estimate profitability threshold = 3
t.c:4:20: note: epilog loop required

while generic has

t.c:4:20: note: vectorization_factor = 4, niters = 8
t.c:4:20: note: === vect_update_slp_costs_according_to_vf ===
cost model: prologue peel iters set to vf/2.
cost model: epilogue peel iters set to vf/2 because peeling for alignment is
unknown.
t.c:4:20: note: Cost model analysis:
  Vector inside of loop cost: 1
  Vector prologue cost: 11
  Vector epilogue cost: 2
  Scalar iteration cost: 1
  Scalar outside cost: 0
  Vector outside cost: 13
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 17
t.c:4:20: note:   Runtime profitability threshold = 16
t.c:4:20: note:   Static estimate profitability threshold = 16
t.c:4:20: note: not vectorized: vectorization not profitable.

somehow the prologue cost looks off for bdver2.

Testcase:

struct ggSpectrum {
    void Set (float d)
      {
        for (int i = 0; i < 8; i++)
          data[i] = d;
      }
    float data[8];
};

void foo (ggSpectrum *s, float d)
{
  s->Set(d);
}

now the best course of action is of course to not even consider peeling
this loop for alignment ... (if it can otherwise vectorize).

I think we run into round-off errors with my fix on bdver2, I have a crude
fix for that.

Reply via email to