On Wed, Aug 15, 2012 at 1:56 PM, Ramana Radhakrishnan <ramana.radhakrish...@linaro.org> wrote: > [It looks like I missed hitting the send button on this response] > >> >> Seems to be one instruction shorter at least ;-) Yes, there can be much >> worse regressions than that because of the patch (like 40 instructions >> instead of 4, in the x86 backend). > > If this is replacing 4 instructions with 40 in x86 backend maybe > someone will notice :) > > Not a win in this particular testcase because the compiler replaces 2 > constant permutes ( that's about 4 cycles) with a load from the > constant pool , a generic permute and in addition are polluting the > icache with guff in the constant pool . If you go to 3 -4 permutes > into a single one then it might be a win but not till that point. > > >> with a-b without first asking the backend whether it might be more >> efficient. One permutation is better than 2. >> It just happens that the range >> of possible permutations is too large (and the basic instructions are too >> strange) for backends to do a good job on them, and thus keeping toplevel >> input as a hint is important. > > Of-course, the problem here is this change of semantics with the hook > TARGET_VEC_PERM_CONST_OK. Targets were expanding to generic permutes > with constants in the *absence* of being able to deal with them with > the specialized permutes. fwprop will now leave us at a point where > each target has to now grow more knowledge with respect to how best to > expand a generic constant permute with a sequence of permutes rather > than just using the generic permute. > > Generating a sequence of permutes from a single constant permute will > be a harder problem than (say) dealing with immediate expansions so > you are pushing more complexity into the target but in the short term > target maintainers should definitely have a heads up that folks using > vector permute intrinsics could actually have performance regressions > on their targets.
It's of course the same with the user input containing such a non-optimal handled constant permute. So I'm less convinced that it's too much burden on the target side. OTOH if there is a generic kind of shuffles that targets do not implement directly but can be simulated by two that are directly implemented pushing the logic to the expander (and adjusting the target hook semantic) would be ok. Richard. > Thanks, > Ramana