On Tue, Apr 10, 2018 at 12:40 PM, Richard Sandiford <richard.sandif...@linaro.org> wrote: > Jakub Jelinek <ja...@redhat.com> writes: >> On Mon, Apr 09, 2018 at 06:47:45PM +0100, Richard Sandiford wrote: >>> In this PR we used WIDEN_SUM_EXPR to vectorise: >>> >>> short i, y; >>> int sum; >>> [...] >>> for (i = x; i > 0; i--) >>> sum += y; >>> >>> with 4 ints and 8 shorts per vector. The problem was that we set >>> the VF based only on the ints, then calculated the number of vector >>> copies based on the shorts, giving 4/8. Previously that led to >>> ncopies==0, but after r249897 we pick it up as an ICE. >>> >>> In this particular case we could vectorise the reduction by setting >>> ncopies based on the output type rather than the input type, but it >>> doesn't seem worth adding a special "optimisation" for such a >>> pathological case. I think it's really an instance of the more general >>> problem that we can't vectorise using combinations of (say) 64-bit and >>> 128-bit vectors on targets that support both. >> >> We badly need that, there are plenty of PRs where we generate really large >> vectorized loop because of it e.g. on x86 where we can easily use 128-bit, >> 256-bit and 512-bit vectors; but I'm afraid it is not a stage4 material. > > Yeah. We also need it on AArch64 for a proper implementation of simd > clones for Advanced SIMD. > > I think it's related to one of the most important missed optimisations > for SVE: when using mixed data sizes, it's usually better to store the > smaller data unpacked in wider lanes, and there's direct support for > loading and storing it that way. In both the SVE and non-SVE cases, > we want the VF sometimes to be based on wider sizes rather than the > narrowest one.
It's unfortunately not very easy to remove the limitation in full and in general it widens the space we need to search for the best vectorization even further... > FWIW, I have some patches queued for GCC 9 that should make it > easier to implement this (but no promises). They're also supposed > to make it possible to compare the costs of different implementations > side-by-side, rather than always picking the first one that has > a lower cost than the scalar code. I have also a similar patch in the works. Richard. > Thanks, > Richard