http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57233
Bug ID: 57233 Summary: Vector lowering of LROTATE_EXPR pessimizes code Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Hello, the vector lowering pass, when it sees a rotate on a vector that is not a supported operation, lowers it to scalar rotates. However, from a quick look at the RTL expanders (untested), they know how to handle a vector rotate as long as shifts and ior are supported, and that would yield better code than the scalar ops. So I think the vector lowering pass should not just check if rotate is supported, but also if shift and ior are, before splitting the operation. typedef unsigned vec __attribute__((vector_size(4*sizeof(int)))); vec f(vec a){ return (a<<2)|(a>>30); } without rotate: vpsrld $30, %xmm0, %xmm1 vpslld $2, %xmm0, %xmm0 vpor %xmm0, %xmm1, %xmm0 with a patch that recognizes rotate for vectors: vpextrd $2, %xmm0, %edx vmovd %xmm0, %eax rorx $30, %eax, %eax movl %eax, -16(%rsp) rorx $30, %edx, %ecx vpextrd $1, %xmm0, %eax movl %ecx, -12(%rsp) vmovd -16(%rsp), %xmm3 vpextrd $3, %xmm0, %edx vmovd -12(%rsp), %xmm2 rorx $30, %eax, %eax rorx $30, %edx, %edx vpinsrd $1, %eax, %xmm3, %xmm1 vpinsrd $1, %edx, %xmm2, %xmm0 vpunpcklqdq %xmm0, %xmm1, %xmm0 (I am not sure all those ext/ins are optimal, I would have expected one mov from xmm0 to memory, then the scalar rotates are done and write to memory again, and one final mov back to the FPU, but my intuition may be wrong)