On 12 June 2012 11:46, Richard Guenther <richard.guent...@gmail.com> wrote: > On Tue, Jun 12, 2012 at 11:22 AM, Julian Brown <jul...@codesourcery.com> > wrote: >> On Mon, 11 Jun 2012 16:46:27 +0100 >> Ramana Radhakrishnan <ramana.radhakrish...@linaro.org> wrote: >> >>> Hi, >>> >>> I don't like the ML bits of the patch as it stands today and before >>> committing I would like to clean up the ML bits quite a bit further >>> especially in areas where I've put FIXMEs [...] >> >> I had a go at this, see attached. Untested. Note there are some >> semantic differences in output: >> >> vzipq_p8 (poly8x16_t __a, poly8x16_t __b) >> { >> poly8x16x2_t __rv; >> - uint8x16_t __mask1 = {0, 2}; >> - uint8x16_t __mask2 = {1, 3}; >> - __rv.val[0] = (poly8x16_t)__builtin_shuffle (__a, __b, __mask1); >> - __rv.val[1] = (poly8x16_t)__builtin_shuffle (__a, __b, __mask2); >> + uint8x16_t __mask1 = { 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, >> 22, 7, 23 }; >> + uint8x16_t __mask2 = { 8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29, >> 14, 30, 15, 31 }; >> + __rv.val[0] = (poly8x16_t) __builtin_shuffle (__a, __b, __mask1); >> + __rv.val[1] = (poly8x16_t) __builtin_shuffle (__a, __b, __mask2); > > You should get better code at -O0 when not using a temporary __mask1/__mask2 > but directly pasting the constant in the builtin call.
I tried that yesterday but it didn't seem to help - from a quick peek at the dumps it looks like we could do with some limited const prop just for the vec_perm expand cases. D.14032 = { 0, 8, 1, 9, 2, 10, 3, 11 }; D.14044 = VEC_PERM_EXPR <__a, __b, D.14032>; That's what I see from the dumps and from a quick skim of the sources - my suspicion is that lower_vec_perm in tree-vect-generic.c is where we could try doing a limited constant propagation in this case. ? Is that where one should attempt to fix this ? Consider the following testcase at O0 rewritten with just __builtin_shuffle so that you can see it on other platforms as well that have vec_perm_const defined for doing the interleave style operations. and look at what you get for O1. On arm-linux-gnueabi with -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a9 at O0 you'd see it use the generic permute operations and at O1 you'd see a vzip.32 instruction typedef int v4si __attribute__ ((vector_size (16))); v4si vs (v4si a, v4si b) { return __builtin_shuffle (a, b, (v4si) {0, 4, 1, 5}); } regards Ramana > >> return __rv; >> } >> >> I wasn't quite sure which version was correct -- but your version >> doesn't seem to have enough elements for these cases? >> >> HTH, >> >> Julian