[Bug rtl-optimization/50728] Inefficient vector loads from aggregates passed by value

rguenth at gcc dot gnu.org Sat, 15 Oct 2011 01:58:30 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50728


--- Comment #4 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-15 
08:57:27 UTC ---
(In reply to comment #3)
> The problem is that the ABI was designed with the scalar operations
> in mind, rather than possible vectorization.  If you consider an
> alternate function
> 
> A foo(A a, A b)
> {
>   a.a[0] += b.a[0];
>   a.a[1] -= b.a[1];
>   a.a[2] *= b.a[2];
>   a.a[3] /= b.a[3];
>   return a;
> }
> 
> then the way the ABI passes the floats *is* optimal.  I.e. already
> unpacked in the registers, ready for use in their scalar operations.

Actually they are passed in pairs, not exactly ready for computation
(they are for doubles, but not for floats due to the 8-byte packing).

> What you're asking for is a special private ABI for "sum", with the
> knowledge that the inputs are used, packed in their vectors.

Actually I ask for the compiler to optimize the stack store/load
sequence into register moves using movhps ...

> Given that you can achieve the parameter register assignment that
> you want via passing the proper vector type, this seems to be a 
> simple matter of function cloning/versioning:
> 
>   V4SF sum.vector(V4SF a, V4SF b)
>   {
>     return a + b;
>   }
> 
>   user_of_sum()
>   {
>     ...
>     V4SF r.v = sum.vector(VIEW_CONVERT<V4SF, a>, VIEW_CONVERT<V4SF, b>);
>     A r = VIEW_CONVERT<A, r.v>;
>     ...
>   }
> 
> Of course, I've no idea how you're going to decide when to produce
> this particular clone.  That seems like a fairly hard decision to make,
> given the relative placements of the vectorization passes and the
> IPA passes.

Yeah, that's going to be difficult.

[Bug rtl-optimization/50728] Inefficient vector loads from aggregates passed by value

Reply via email to