https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- exploring more options I noticed there's no arithmetic vector V2DI right shift, so vectorizing uint64_t carry = (uint64_t)(((int64_t)W[1]) >> 63) & (uint64_t)135; W[1] = (W[1] << 1) ^ ((uint64_t)(((int64_t)W[0]) >> 63) & (uint64_t)1); W[0] = (W[0] << 1) ^ carry; didn't work out. But V2DI >> CST with CST > 31 can be implemented with VPSRAD and then doing PMOVSXDQ after shuffling the high shifted part into low position. Maybe there's sth more clever for the special case of >> 63 even. As said, just trying if "optimal" vectorization of the kernel would solve the issue. But I guess pipelines are wide enough so the original scalar code effectively executes "vectorized".