https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112992
Bug ID: 112992 Summary: Inefficient vector initialization using vec_duplicate/broadcast Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: roger at nextmovesoftware dot com Target Milestone: --- The following four functions should in theory all produce the same code: typedef unsigned long long v4di __attribute((vector_size(32))); typedef unsigned int v8si __attribute((vector_size(32))); typedef unsigned short v16hi __attribute((vector_size(32))); typedef unsigned char v32qi __attribute((vector_size(32))); #define MASK 0x01010101 #define MASKL 0x0101010101010101ULL #define MASKS 0x0101 v4di fooq() { return (v4di){MASKL,MASKL,MASKL,MASKL}; } v8si food() { return (v8si){MASK,MASK,MASK,MASK,MASK,MASK,MASK,MASK}; } v16hi foow() { return (v16hi){MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS, MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS}; } v32qi foob() { return (v32qi){1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}; } On x86_64 with -mavx, we currently produce very different implementations: fooq: movabs rax, 72340172838076673 push rbp mov rbp, rsp and rsp, -32 mov QWORD PTR [rsp-8], rax vbroadcastsd ymm0, QWORD PTR [rsp-8] leave ret food: vbroadcastss ymm0, DWORD PTR .LC2[rip] ret foow: vmovdqa ymm0, YMMWORD PTR .LC3[rip] ret foob: vmovdqa ymm0, YMMWORD PTR .LC4[rip] ret clang currently produces the vbroadcastss for all four. I discovered that some of my "day job" code used the "fooq" idiom, requiring a stack frame, and both reads and writes to memory [of a compile-time constant]. I suspect the fix is to add a define_insn_and_split or two to i386/sse.md, and perhaps something can be done in expand, but I'm confused why LRA/reload spills the DImode component of V4DI to the stack frame, but places the SImode component of V8SI in the constant pool. This is related (distantly) to PRs 100865 and 106060, but is potentially target independent, and seems to be going wrong in LRA/reload's REG_EQUIV elimination. Thoughts? Apologies if this is a dup. I'm happy to work up a patch if someone could advise on where best this should be fixed. Perhaps RTL's vec_duplicate could be canonicalized to the most appropriate vector mode?