[Bug target/90424] memcpy into vector builtin not optimized

rguenth at gcc dot gnu.org Tue, 14 May 2019 04:35:44 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424


--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Matthias Kretz from comment #2)
> I can't read the SSA code with certainty, but bit-inserting sounds like what
> I want to have. Alternatively, the partial vector load could be implemented
> like this - and looks even worse (https://godbolt.org/z/nJuTn-):
> template <class T>
> using V [[gnu::vector_size(16)]] = T;
> 
> template <class T, unsigned... I>
> V<T> load(const void *p) {
>   const T* q = static_cast<const T*>(p);
>   V<T> r = {q[I]...};
>   return r;
> }
> 
> // movq or movsd
> template V<char  > load<char  , 0,1,2,3,4,5,6,7>(const void *);
> template V<short > load<short , 0,1,2,3>(const void *);
> template V<int   > load<int   , 0,1>(const void *);
> template V<long  > load<long  , 0>(const void *);
> template V<float > load<float , 0,1>(const void *);
> template V<double> load<double, 0>(const void *);
> 
> // movd or movss
> template V<char > load<char , 0,1,2,3>(const void *);
> template V<short> load<short, 0,1>(const void *);
> template V<int  > load<int  , 0>(const void *);
> template V<float> load<float, 0>(const void *);

Those end up like

load<int, 0, 1> (const void * p)
{
  V r;
  int _1;
  int _2;

  <bb 2> [local count: 1073741824]:
  _1 = MEM[(const int *)p_3(D)];
  _2 = MEM[(const int *)p_3(D) + 4B];
  r_5 = {_1, _2};
  return r_5;

it's not immediately clear where to optimize this - the loads would need to
be merged and the constructor adjusted to one from vectors.  The bswap
pass looks like a good candidate for this.  Split out to PR90460,

[Bug target/90424] memcpy into vector builtin not optimized

Reply via email to