On Mon, 6 May 2019, Jakub Jelinek wrote:

> On Fri, May 03, 2019 at 12:47:39PM +0200, Richard Biener wrote:
> > On Wed, Dec 12, 2018 at 11:54 AM Richard Biener <rguent...@suse.de> wrote:
> > >
> > >
> > > The following improves x264 vectorization by avoiding peeling for gaps
> > > noticing that when the upper half of a vector is unused we can
> > > load the lower part only (and fill the upper half with zeros - this
> > > is what x86 does automatically, GIMPLE doesn't allow us to leave
> > > the upper half undefined as RTL would with using subregs).
> > >
> > > The implementation is a little bit awkward as for optimal GIMPLE
> > > code-generation and costing we'd like to go the strided load path
> > > instead.  That proves somewhat difficult though thus the following
> > > is easier but doesn't fill out the re-align paths nor the masked
> > > paths (at least the fully masked path would never need peeling for
> > > gaps).
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, tested with
> > > SPEC CPU 2006 and 2017 with the expected (~4%) improvement for
> > > 625.x264_s.  Didn't see any positive or negative effects elsewhere.
> > >
> > > Queued for GCC 10.
> > 
> > Applied as r270847.
> 
> This regressed
> FAIL: gcc.target/i386/avx512vl-pr87214-1.c execution test
> (AVX512VL hw or SDE is needed to reproduce).

Looking at this.  Reproducible with SSE4.2 and

struct s { unsigned int a, b, c; };

void __attribute__ ((noipa))
foo (struct s *restrict s1, struct s *restrict s2, int n)
{
  for (int i = 0; i < n; ++i)
    {
      s1[i].b = s2[i].b;
      s1[i].c = s2[i].c;
      s2[i].c = 0;
    }
}

#define N 12

int
main ()
{
  struct s s1[N], s2[N];
  for (unsigned int j = 0; j < N; ++j)
    {
      s2[j].a = j * 5;
      s2[j].b = j * 5 + 2;
      s2[j].c = j * 5 + 4;
    }
  foo (s1, s2, N);
  for (unsigned int j = 0; j < N; ++j)
  if (s1[j].b != j * 5 + 2)
    __builtin_abort ();
  return 0;
}

Probably the cause of PR90358

Richard.

Reply via email to