https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so the "easier" way to allow aligned sub-vector inserts produces for

typedef unsigned char v16qi __attribute__((vector_size(16)));
v16qi load (const void *p)
{
  v16qi r;
  __builtin_memcpy (&r, p, 8);
  return r;
}

load (const void * p)
{
  v16qi r;
  long unsigned int _3;
  v16qi _5;
  vector(8) unsigned char _7;

  <bb 2> :
  _3 = MEM[(char * {ref-all})p_2(D)];
  _7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3);
  r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>;
  _5 = r_9;
  return _5;

and unfortunately (as I feared)

load:
.LFB0:
        .cfi_startproc
        movq    (%rdi), %rax
        pxor    %xmm1, %xmm1
        movaps  %xmm1, -24(%rsp)
        movq    %rax, -24(%rsp)
        movdqa  -24(%rsp), %xmm0
        ret

via expanding to

(insn 8 7 9 2 (set (subreg:V8QI (reg:V16QI 89 [ r ]) 0)
        (subreg:V8QI (reg:DI 88) 0)) "t.c":5:3 -1
     (nil))

RAed from

(insn 8 7 13 2 (set (subreg:V8QI (reg:V16QI 89 [ r ]) 0)
        (mem:V8QI (reg:DI 90) [0 MEM[(char * {ref-all})p_2(D)]+0 S8 A8]))
"t.c":5:3 1088 {*movv8qi_internal}
     (expr_list:REG_DEAD (reg:DI 90)
        (nil)))


It's still IMHO the most reasonable IL given the vector constructors
we allow.

Inserting 4 bytes is even worse though.  Inserting upper 8 bytes is
like the above.

Code generation isn't worse than unpatched and the GIMPLE is clearly
better (allowing for followup optimizations).

Reply via email to