https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117173

--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> ---
In x264, before the optimization we have:

_42 = VEC_PERM_EXPR <vect__49.83_41, vect__52.84_40, { 0, 1, 10, 11, 4, 5, 14,
15 }>;
...
_44 = VEC_PERM_EXPR <vect_t0_114.85_43, vect_t0_114.85_43, { 1, 3, 1, 3, 5, 7,
5, 7 }>;
_45 = VEC_PERM_EXPR <vect_t0_114.85_43, vect_t0_114.85_43, { 0, 2, 0, 2, 4, 6,
4, 6 }>;

The first one (_42) is "monotonic" and can be implemented by a vmerge.  This
implies a load and one instruction.

_44 and _45 can be implemented by one vrgather each because they have a single
source.


After the optimization we have: 

_838 = VEC_PERM_EXPR <vect__49.83_41, vect__52.84_40, { 1, 11, 1, 11, 5, 15, 5,
15 }>;
_846 = VEC_PERM_EXPR <vect__49.83_41, vect__52.84_40, { 0, 10, 0, 10, 4, 14, 4,
14 }>;

Both of those have two sources and, generally, require two vrgathers (each, one
possibly masked) and we need to arrange the indices properly.
I don't think our current implementation for this generic approach is ideal but
it will never be as cheap as the two non-merged permutes.

Of course we could try deconstructing the index to arrive at the "before"
but... :)

Reply via email to