Re: RFC: Representing vector lane load/store operations

Richard Guenther Tue, 22 Mar 2011 10:10:36 -0700

On Tue, Mar 22, 2011 at 5:52 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> This is an RFC about adding gimple and optab support for things like
> ARM's load-lane and store-lane instructions.  It builds on an earlier
> discussion between Ira and Julian, with the aim of allowing these
> instructions to be used by the vectoriser.
>
> These instructions operate on N vector registers of M elements each and
> on a sequence of 1 or M N-element structures.  They come in three forms:
>
>  - full load/store:
>
>      0<=I<N, 0<=J<M, register[I][J] = memory[J*M+I]
>
>    E.g., for N=3, M=4:
>
>         Registers                   Memory
>         ----------------            ---------------
>         RRRR  GGGG  BBBB    <--->   RGB RGB RGB RGB
>
>  - lane load/store:
>
>      given L, 0<=I<N register[I][L] = memory[I]
>
>    E.g., for N=3. M=4, L=2:
>
>         Registers                   Memory
>         ----------------            ---------------
>         ..R.  ..G.  ..B.    <--->   RGB
>
>  - load-and-duplicate:
>
>      0<=I<N, 0<=J<M, register[I][J] = memory[I]
>
>    E.g. for N=3 V4HIs:
>
>         Registers                   Memory
>         ----------------            ----------------
>         RRRR  GGGG  BBBB    <----   RGB
>
> Starting points:
>
>  1) Memory references should be MEM_REFs at the gimple level.
>     We shouldn't add new tree codes for memory references.
>
>  2) Because of the large data involved (at least in the "full" case),
>     the gimple statement that represents the lane interleaving should
>     also have the MEM_REF.  The two shouldn't be split between
>     statements.
>
>  3) The ARM doubleword instructions allow the N vectors to be in
>     consecutive registers (DM, DM+1, ...) or in every second register
>     (DM, DM+2, ...).  However, the latter case is only interesting
>     if we're dealing with halves of quadword vectors.  It's therefore
>     reasonable to view the N vectors as one big value.
>
> (3) significantly simplifies things at the rtl level for ARM, because it
> avoids having to find some way of saying that N separate pseudos must
> be allocated to N consecutive hard registers.  If other targets allow the
> N vectors to be stored in arbitrary (non-consecutive) registers, then
> they could split the register up into subregs at expand time.
> The lower-subreg pass should then optimise things nicely.
>
> The easiest way of dealing with (1) and (2) seems to be to model the
> operations as built-in functions.  And if we do treat the N vectors as
> a single value, the load functions can simply return that value.  So we
> could have something like:
>
>  - full load/store:
>
>      combined_vectors = __builtin_load_lanes (memory);
>      memory = __builtin_store_lanes (combined_vectors);
>
>  - lane load/store:
>
>      combined_vectors = __builltin_load_lane (memory, combined_vectors, lane);
>      memory = __builtin_store_lane (combined_vectors, lane);
>
>  - load-and-duplicate:
>
>      combined_vectors = __builtin_load_dup (memory);
>
> We could then use normal component references to set or get the individual
> vectors of combined_vectors.  Does that sound OK so far?
>
> The question then is: what type should combined_vectors have?  (At this
> point I'm just talking about types, not modes.)  The main possibilities
> seemed to be:
>
> 1. an integer type
>
>     Pros
>       * Gimple registers can store integers.
>
>     Cons
>       * As Julian points out, GCC doesn't really support integer types
>         that are wider than 2 HOST_WIDE_INTs.  It would be good to
>         remove that restriction, but it might be a lot of work.
>
>       * We're not really using the type as an integer.
>
>       * The combination of the integer type and the __builtin_load_lanes
>         array argument wouldn't be enough to determine the correct
>         load operation.  __builtin_load_lanes would need something
>         like a vector count argument (N in the above description) as well.
>
> 2. a vector type
>
>     Pros
>       * Gimple registers can store vectors.
>
>     Cons
>       * For vld3, this would mean creating vector types with non-power-
>         of-two vectors.  GCC doesn't support those yet, and you get
>         ICEs as soon as you try to use them.  (Remember that this is
>         all about types, not modes.)
>
>         It _might_ be interesting to implement this support, but as
>         above, it would be a lot of work.  It also raises some tricky
>         semantic questions, such as: what is the alignment of the new
>         vectors? Which leads to...
>
>       * The alignment of the type would be strange.  E.g. suppose
>         we're dealing with M=2, and use uint32xY_t to represent a
>         vector of Y uint32_ts.  The types and alignments would be:
>
>           N=2 uint32x4_t, alignment 16
>           N=3 uint32x6_t, alignment 8 (if we follow the convention for modes)
>           N=4 uint32x8_t, alignment 32
>
>         We don't need alignments greater than 8 in our intended use;
>         16 and 32 are overkill.
>
>       * We're not really using the type as a single vector,
>         but as a collection of vectors.
>
>       * The combination of the vector type and the __builtin_load_lanes
>         array argument wouldn't be enough to determine the correct
>         load operation.  __builtin_load_lanes would need something
>         like a vector count argument (N in the above description) as well.
>
> 3. an array-of-vectors type
>
>     Pros
>       * No support for new GCC features (large integers or non-power-of-two
>         vectors) is needed.
>
>       * The alignment of the type would be taken from the alignment of the
>         individual vectors, which is correct.
>
>       * It accurately reflects how the loaded value is going to be used.
>
>       * The type uniquely identifies the correct load operation,
>         without need for additional arguments.  (This is minor.)
>
>     Cons
>       * Gimple registers can't store array values.


Simple.  Just make them registers anyway (I did that in the past
when working on middle-end arrays).  You'd set DECL_GIMPLE_REG_P
on the decl.

  4. a vector-of-vectors type

     Cons
        * I don't think we want that ;)

Using an array type sounds like the only sensible option to me apart
from using a large non-power-of-two vector type (but then you'd have
the issue of what operations operate on, see below).

> So I think the only disadvantage of using an array of vectors is that the
> result can never be a gimple register.  But that isn't much of a disadvantage
> really; the things we care about are the individual vectors, which can
> of course be treated as gimple registers.  I think our tracking of memory
> values is good enough for combined_vectors to be treated as such.
>
> These arrays of vectors would still need to have a non-BLK mode,
> so that they can be stored in _rtl_ registers.  But we need that anyway
> for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic
> functions is very poor.
>
> So how about the following functions?  (Forgive the pascally syntax.)
>
>    __builtin_load_lanes (REF : array N*M of X)
>      returns array N of vector M of X
>      maps to vldN on ARM
>      in practice, the result would be used in assignments of the form:
>        vectorY = ARRAY_REF <result, Y>
>
>    __builtin_store_lanes (VECTORS : array N of vector M of X)
>      returns array N*M of X
>      maps to vstN on ARM
>      in practice, the argument would be populated by assignments of the form:
>        ARRAY_REF <VECTORS, Y> = vectorY
>
>    __builtin_load_lane (REF : array N of X,
>                         VECTORS : array N of vector M of X,
>                         LANE : integer)
>      returns array N of vector M of X
>      maps to vldN_lane on ARM
>
>    __builtin_store_lane (VECTORS : array N of vector M of X,
>                          LANE : integer)
>      returns array N of X
>      maps to vstN_lane on ARM
>
>    __builtin_load_dup (REF : array N of X)
>      returns array N of vector M of X
>      maps to vldN_dup on ARM
>
> I've hacked up a prototype of this and it seems to produce good code.
> What do you think?

How do you expect these to be used?  That is, would you ever expect
components of those large vectors/arrays be used in operations
like add, or does the HW provide vector-lane variants for those as well?

Thus, will

  for (i=0; i<N; ++i)
    X[i] = Y[i] + Z[i];

result in a single add per vector lane load or a single vector lane load
for M "unrolled" instances of (small) vector adds?  If the latter then
we have to think about indexing the vector lanes as well as allowing
partial stores (or have a vector-lane construct operation).  Representing
vector lanes as automatic memory (with array of vector type) makes
things easy, but eventually not very efficient.

I had new tree/stmt codes for array loads/stores for middle-end arrays.
Eventually the vector lane support can at least walk in the same direction
that middle-end arrays would ;)

Richard.

> Richard
>

Re: RFC: Representing vector lane load/store operations

Reply via email to