https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- OK, so the "easier" way to allow aligned sub-vector inserts produces for typedef unsigned char v16qi __attribute__((vector_size(16))); v16qi load (const void *p) { v16qi r; __builtin_memcpy (&r, p, 8); return r; } load (const void * p) { v16qi r; long unsigned int _3; v16qi _5; vector(8) unsigned char _7; <bb 2> : _3 = MEM[(char * {ref-all})p_2(D)]; _7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3); r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>; _5 = r_9; return _5; and unfortunately (as I feared) load: .LFB0: .cfi_startproc movq (%rdi), %rax pxor %xmm1, %xmm1 movaps %xmm1, -24(%rsp) movq %rax, -24(%rsp) movdqa -24(%rsp), %xmm0 ret via expanding to (insn 8 7 9 2 (set (subreg:V8QI (reg:V16QI 89 [ r ]) 0) (subreg:V8QI (reg:DI 88) 0)) "t.c":5:3 -1 (nil)) RAed from (insn 8 7 13 2 (set (subreg:V8QI (reg:V16QI 89 [ r ]) 0) (mem:V8QI (reg:DI 90) [0 MEM[(char * {ref-all})p_2(D)]+0 S8 A8])) "t.c":5:3 1088 {*movv8qi_internal} (expr_list:REG_DEAD (reg:DI 90) (nil))) It's still IMHO the most reasonable IL given the vector constructors we allow. Inserting 4 bytes is even worse though. Inserting upper 8 bytes is like the above. Code generation isn't worse than unpatched and the GIMPLE is clearly better (allowing for followup optimizations).