[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 24 Feb 2022 23:33:51 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908


--- Comment #22 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #21)
> Now we have SLP node available in vector cost hook, maybe we can do sth in
> cost model to prevent vectorization when node's definition from big-size
> parameter.

Note we vectorize a load here for which we do not pass down an SLP node.
But of course there's the stmt-info one could look at - but the issue
is that for SLP that doesn't tell you which part of the variable is accessed.
Also even if we were to pass down the SLP node we do not know exactly how
it is going to vectorize - but sure, we could play with some heuristics
there.

For x86 we can just assume that all aggregates > 16 bytes are passed on the
stack, correct?  Note I see for

#include <stdlib.h>

struct X { double x[3]; };
typedef double v2df __attribute__((vector_size(16)));

v2df __attribute__((noipa))
foo (struct X x, struct X y)
{
  return (v2df) {x.x[1], x.x[2] } + (v2df) { y.x[0], y.x[1] };
}

struct X y;
int main(int argc, char **argv)
{
  struct X x = y;
  int cnt = atoi (argv[1]);
  for (int i = 0; i < cnt; ++i)
    foo (x, x);
  return 0;
}

the structs passed as

        movups  %xmm0, 24(%rsp)
        movq    %rax, 40(%rsp)
        movq    %rax, 16(%rsp)
        movups  %xmm0, (%rsp)
        call    foo

so alignment of the stack variable depends on the position of the
function argument (and thus preceeding parameters).  That means
we cannot rely on &y being 16 byte aligned and it seems we cannot
rely on a particular store sequence order either here.

That would mean pessimization of all incoming stack parameters
> 16 bytes in size (maybe also == 16 bytes?) because we do not know
how the caller pushed the parameters?  (without the caller using
%xmm stores all such vectorization would trigger STLF failures - dependent
on the load-to-store "distance" of course).

Can you peek engineers at Intel at what a big enough "distance" would be
to make sure the store hit L1 (and is a load from L1 better than a
failed STLF, thus the store still in buffers but not forwardable)?

[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

Reply via email to