On Mon, 6 May 2019, Jakub Jelinek wrote: > On Fri, May 03, 2019 at 12:47:39PM +0200, Richard Biener wrote: > > On Wed, Dec 12, 2018 at 11:54 AM Richard Biener <rguent...@suse.de> wrote: > > > > > > > > > The following improves x264 vectorization by avoiding peeling for gaps > > > noticing that when the upper half of a vector is unused we can > > > load the lower part only (and fill the upper half with zeros - this > > > is what x86 does automatically, GIMPLE doesn't allow us to leave > > > the upper half undefined as RTL would with using subregs). > > > > > > The implementation is a little bit awkward as for optimal GIMPLE > > > code-generation and costing we'd like to go the strided load path > > > instead. That proves somewhat difficult though thus the following > > > is easier but doesn't fill out the re-align paths nor the masked > > > paths (at least the fully masked path would never need peeling for > > > gaps). > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, tested with > > > SPEC CPU 2006 and 2017 with the expected (~4%) improvement for > > > 625.x264_s. Didn't see any positive or negative effects elsewhere. > > > > > > Queued for GCC 10. > > > > Applied as r270847. > > This regressed > FAIL: gcc.target/i386/avx512vl-pr87214-1.c execution test > (AVX512VL hw or SDE is needed to reproduce).
Looking at this. Reproducible with SSE4.2 and struct s { unsigned int a, b, c; }; void __attribute__ ((noipa)) foo (struct s *restrict s1, struct s *restrict s2, int n) { for (int i = 0; i < n; ++i) { s1[i].b = s2[i].b; s1[i].c = s2[i].c; s2[i].c = 0; } } #define N 12 int main () { struct s s1[N], s2[N]; for (unsigned int j = 0; j < N; ++j) { s2[j].a = j * 5; s2[j].b = j * 5 + 2; s2[j].c = j * 5 + 4; } foo (s1, s2, N); for (unsigned int j = 0; j < N; ++j) if (s1[j].b != j * 5 + 2) __builtin_abort (); return 0; } Probably the cause of PR90358 Richard.