On Fri, May 03, 2019 at 12:47:39PM +0200, Richard Biener wrote: > On Wed, Dec 12, 2018 at 11:54 AM Richard Biener <rguent...@suse.de> wrote: > > > > > > The following improves x264 vectorization by avoiding peeling for gaps > > noticing that when the upper half of a vector is unused we can > > load the lower part only (and fill the upper half with zeros - this > > is what x86 does automatically, GIMPLE doesn't allow us to leave > > the upper half undefined as RTL would with using subregs). > > > > The implementation is a little bit awkward as for optimal GIMPLE > > code-generation and costing we'd like to go the strided load path > > instead. That proves somewhat difficult though thus the following > > is easier but doesn't fill out the re-align paths nor the masked > > paths (at least the fully masked path would never need peeling for > > gaps). > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, tested with > > SPEC CPU 2006 and 2017 with the expected (~4%) improvement for > > 625.x264_s. Didn't see any positive or negative effects elsewhere. > > > > Queued for GCC 10. > > Applied as r270847.
This regressed FAIL: gcc.target/i386/avx512vl-pr87214-1.c execution test (AVX512VL hw or SDE is needed to reproduce). > > 2018-12-12 Richard Biener <rguent...@suse.de> > > > > * tree-vect-stmts.c (get_group_load_store_type): Avoid > > peeling for gaps by loading only lower halves of vectors > > if possible. > > (vectorizable_load): Likewise. > > > > * gcc.dg/vect/slp-reduc-sad-2.c: New testcase. Jakub