We're going to commit to autovect-branch vectorization support for
non-unit-stride accesses.
We'd like to suggest a few new tree-codes/optabs in order to express the
extraction and merging of elements from/to vectors.

Here are the suggested tree-codes/optabs; an example on how they are going
to be used follows.

(1) extract elements at even/odd fields from two vectors:
optabs: vec_extract_even_optab, vec_extract_odd_optab
tree-codes: VEC_EXTRACT_EVEN_EXPR, VEC_EXTRACT_ODD_EXPR

to be used as follows:
      vec3 = vec_extract_even_expr (vec1, vec2)
      vec4 = vec_extract_odd_expr (vec1, vec2)
producing (for vectorization factor (VF)= 4):
      vec3 = [vec1.0, vec1.2, vec2.0, vec2.2]
      vec4 = [vec1.1, vec1.3, vec2.1, vec2.3]

(2) extract the high/low-order elements of two vectors and merge them
together:
optabs: vec_interleave_high_optab, vec_interleave_low_optab
tree-code: VEC_INTERLEAVE_HIGH_EXPR, VEC_INTERLEAVE_LOW_EXPR

to be used as follows:
      vec3 = vec_interleave_high_expr (vec1, vec2)
      vec4 = vec_interleave_low_expr (vec1, vec2)
producing (for VF=4):
      vec3 = [vec1.0, vec2.0, vec1.1, vec2.1]
      vec4 = [vec1.2, vec2.2, vec1.3, vec2.3]

Notes:
1) These 4 idioms can be used to vectorize strided accesses with power-of-2
strides: to vectorize a load with stride d we'd need d*log(d) extracts, and
similarly for stores (a store with stride d requires  d*log(d)
interleaves). Note that you can't get any better than that even if you have
more powerful mechanisms than extract_even/odd and interleave_hi/lo (you
need d-1 operations to extract elements from d vectors even if each
operation can extract an arbitrary set of elements from a pair of vectors).

2) The choice to address only power-of-2 strides as a first step was made
because these are the most common strides found in computations, and
because there's a well defined regularity in handling power-of-2 strides
(with respect to generating the extracts/interleaves), so it's much easier
to automate than non-power-of-2 strides where there's less regularity, and
usually requires a tailored solution per stride.

3) We could introduce more generic extract/interleave idioms, that also
take a permutation mask as a third argument and extract/interleave elements
according to the mask (e.g. permute). However, such generic permutes are
not generally supported by platforms. Also, as noted above, they do not
provide a more efficient solution than the extract_odd/even and
interleave_high/low for the power-of-2-stride case. In the future, when we
extend this work to handle non-power-of-2 strides, we'll need to introduce
more powerful mechanisms (ala permute).

4) While a general permute like operation is generally not supported
efficiently on most platforms, the above 4 idioms can be supported
efficiently even on platforms that have only a specialized version of
permute. SSE can use a combination of pshuf* and unpack* instructions (and
in some cases just a shufps); Altivec can use the vperm (and in some cases
a vmerge*).
We'll appreciate feedback on how these idioms could be supported on other
platforms.

more details below,


dorit and Ira


Background:
      The new functionality is going to allow us to vectorize computations
with strides that are a power-of-2, like in the example below, in which the
real and imaginary parts are interleaved, and therefore each of the
data-refs accesses data with stride 2:

  for (i = 0; i < n; i++) {
     tmp_re = in[2*i] * coefs[2*i] - in[2*i+1] * coefs[2*i+1];
     tmp_im = in[2*i] * coefs[2*i+1] + in[2*i+1] * coefs[2*i];
     out[2*i] = tmp_re;
     out[2*i+1] = temp_im;
  }

What is generally going to happen is that, for a VF=4, we're going to:

(1) load this data from memory:
      vec_in1 = [re0,im0,re1,im1] = vload &in
      vec_in2 = [re2,im2,re3,im3] = vload &in[VF]
      (and similarly for the coefs array)

and then, because we're doing different operations on the odd and even
elements, we need to
(2) arrange them into separate vectors:
      vec_in_re = [re0,re1,re2,re3] = extract_even (vec_in1, vec_in2)
      vec_in_im = [im0,im1,im2,im3] = extract_odd (vec_in1, vec_in2)
      (and similarly for the coefs array)

(3) now we can operate on the different elements:
      vec_tmp_re = vec_in_re * vec_coefs_re - vec_in_im * vec_coefs_im
      vec_tmp_im = vec_in_re * vec_coefs_im + vec_in_im * vec_coefs_re

so now we have:
      vec_tmp_re = [tmp_re0,tmp_re1,tmp_re2,tmp_re3]
      vec_tmp_im = [tmp_im0,tmp_im1,tmp_im2,tmp_im3]

Lastly, upon a store to memory, we need to
(4) interleave the results:
      vec_tmp_1 = interleave_high (vec_tmp_re, vec_tmp_im)
      vec_tmp_2 = interleave_low (vec_tmp_re, vec_tmp_im)

so now we have:
      vec_tmp_1 = [tmp_re0,tmp_im0,tmp_re1,tmp_im1]
      vec_tmp_2 = [tmp_re2,tmp_im2,tmp_re3,tmp_im3]

ready to be
(5) stored into array out:
      vstore (vec_tmp_1, &out)
      vstore (vec_tmp_1, &out[VF])

Reply via email to