On 13 August 2012 14:21, Marc Glisse <marc.gli...@inria.fr> wrote: > On Mon, 13 Aug 2012, Richard Guenther wrote: > >> On Mon, Aug 13, 2012 at 3:12 PM, Ramana Radhakrishnan >> <ramana.radhakrish...@linaro.org> wrote: >>>> >>>> >>>> I guess people will complain soon enough if this causes horrible >>>> performance >>>> regressions in vectorized code. >>> >>> >>> Not having looked at your patch in great detail,. surely what we don't >>> want is a situation where 2 constant permutations are converted into >>> one generic permute. Based on a quick read of your patch I couldn't >>> work that out. It might be that 2 constant permutes are cheaper than >>> a generic permute. Have you looked at any examples in that space . I >>> surely wouldn't like to see a sequence of interleave / transpose >>> change into a generic permute operation on Neon as that would be far >>> more expensive than this. It surely needs more testting than just >>> this bit before going in. The reason being that this would likely take >>> more registers and indeed produce loads of a constant pool for the new >>> mask. > > > What do you call constant / non-constant? The combined permutation is still > constant, although the expansion (in the back-end) might fail to expand it > efficiently and fall back to the generic permutation expansion...
That is exactly what I was worried about. By constant I would expect something that is expanded as a constant permute by the backend - an interleave operation or a transpose operation or indeed some of the funky operations as below in ARM / Neon which is the SIMD architecture I'm most familiar with . If you had something that did the following : uint8x8_t tst_vrev642_u8 (uint8x8_t __a) { uint8x8_t __rv; uint8x8_t __mask1 = { 7, 6, 5, 4, 3, 2, 1, 0}; uint8x8_t __mask2 = { 0, 8, 2, 10, 4, 12, 6, 14 }; return __builtin_shuffle ( __builtin_shuffle ( __a, __mask1), __mask2) ; } I would expect these instructions vrev64.8 d0, d0 vmov d16, d0 @ v8qi vtrn.8 d0, d16 bx lr With your patch I see tst_vrev642_u8: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fldd d16, .L2 vtbl.8 d0, {d0}, d16 bx lr .L3: .align 3 .L2: .byte 7 .byte 7 .byte 5 .byte 5 .byte 3 .byte 3 .byte 1 .byte 1 It might be that the backend predicates need tightening in this case but why not try to get cost info about such combinations rather than just doing this gratuitously ? While in this case the ARM port might be wrong , but in general when the vector permute rewrites were done we chose to go ahead and keep the generic constant permutes in the backend as the last resort to fall back to. However if fwprop starts making such transformations really one ought to get relative costs for each of these operations rather than allowing gratuitous replacements. Thanks, Ramana