https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
exploring more options I noticed there's no arithmetic vector V2DI right shift,
so vectorizing

  uint64_t carry = (uint64_t)(((int64_t)W[1]) >> 63) & (uint64_t)135;
  W[1] = (W[1] << 1) ^ ((uint64_t)(((int64_t)W[0]) >> 63) & (uint64_t)1);
  W[0] = (W[0] << 1) ^ carry;

didn't work out.  But V2DI >> CST with CST > 31 can be implemented with
VPSRAD and then doing PMOVSXDQ after shuffling the high shifted part into
low position.

Maybe there's sth more clever for the special case of >> 63 even.

As said, just trying if "optimal" vectorization of the kernel would solve
the issue.  But I guess pipelines are wide enough so the original scalar
code effectively executes "vectorized".

Reply via email to