We're going to commit to autovect-branch vectorization support for
non-unit-stride accesses.
We'd like to suggest a few new tree-codes/optabs in order to express the
extraction and merging of elements from/to vectors.
Here are the suggested tree-codes/optabs; an example on how they are going
to be used follows.
(1) extract elements at even/odd fields from two vectors:
optabs: vec_extract_even_optab, vec_extract_odd_optab
tree-codes: VEC_EXTRACT_EVEN_EXPR, VEC_EXTRACT_ODD_EXPR
to be used as follows:
vec3 = vec_extract_even_expr (vec1, vec2)
vec4 = vec_extract_odd_expr (vec1, vec2)
producing (for vectorization factor (VF)= 4):
vec3 = [vec1.0, vec1.2, vec2.0, vec2.2]
vec4 = [vec1.1, vec1.3, vec2.1, vec2.3]
(2) extract the high/low-order elements of two vectors and merge them
together:
optabs: vec_interleave_high_optab, vec_interleave_low_optab
tree-code: VEC_INTERLEAVE_HIGH_EXPR, VEC_INTERLEAVE_LOW_EXPR
to be used as follows:
vec3 = vec_interleave_high_expr (vec1, vec2)
vec4 = vec_interleave_low_expr (vec1, vec2)
producing (for VF=4):
vec3 = [vec1.0, vec2.0, vec1.1, vec2.1]
vec4 = [vec1.2, vec2.2, vec1.3, vec2.3]
Notes:
1) These 4 idioms can be used to vectorize strided accesses with power-of-2
strides: to vectorize a load with stride d we'd need d*log(d) extracts, and
similarly for stores (a store with stride d requires d*log(d)
interleaves). Note that you can't get any better than that even if you have
more powerful mechanisms than extract_even/odd and interleave_hi/lo (you
need d-1 operations to extract elements from d vectors even if each
operation can extract an arbitrary set of elements from a pair of vectors).
2) The choice to address only power-of-2 strides as a first step was made
because these are the most common strides found in computations, and
because there's a well defined regularity in handling power-of-2 strides
(with respect to generating the extracts/interleaves), so it's much easier
to automate than non-power-of-2 strides where there's less regularity, and
usually requires a tailored solution per stride.
3) We could introduce more generic extract/interleave idioms, that also
take a permutation mask as a third argument and extract/interleave elements
according to the mask (e.g. permute). However, such generic permutes are
not generally supported by platforms. Also, as noted above, they do not
provide a more efficient solution than the extract_odd/even and
interleave_high/low for the power-of-2-stride case. In the future, when we
extend this work to handle non-power-of-2 strides, we'll need to introduce
more powerful mechanisms (ala permute).
4) While a general permute like operation is generally not supported
efficiently on most platforms, the above 4 idioms can be supported
efficiently even on platforms that have only a specialized version of
permute. SSE can use a combination of pshuf* and unpack* instructions (and
in some cases just a shufps); Altivec can use the vperm (and in some cases
a vmerge*).
We'll appreciate feedback on how these idioms could be supported on other
platforms.
more details below,
dorit and Ira
Background:
The new functionality is going to allow us to vectorize computations
with strides that are a power-of-2, like in the example below, in which the
real and imaginary parts are interleaved, and therefore each of the
data-refs accesses data with stride 2:
for (i = 0; i < n; i++) {
tmp_re = in[2*i] * coefs[2*i] - in[2*i+1] * coefs[2*i+1];
tmp_im = in[2*i] * coefs[2*i+1] + in[2*i+1] * coefs[2*i];
out[2*i] = tmp_re;
out[2*i+1] = temp_im;
}
What is generally going to happen is that, for a VF=4, we're going to:
(1) load this data from memory:
vec_in1 = [re0,im0,re1,im1] = vload &in
vec_in2 = [re2,im2,re3,im3] = vload &in[VF]
(and similarly for the coefs array)
and then, because we're doing different operations on the odd and even
elements, we need to
(2) arrange them into separate vectors:
vec_in_re = [re0,re1,re2,re3] = extract_even (vec_in1, vec_in2)
vec_in_im = [im0,im1,im2,im3] = extract_odd (vec_in1, vec_in2)
(and similarly for the coefs array)
(3) now we can operate on the different elements:
vec_tmp_re = vec_in_re * vec_coefs_re - vec_in_im * vec_coefs_im
vec_tmp_im = vec_in_re * vec_coefs_im + vec_in_im * vec_coefs_re
so now we have:
vec_tmp_re = [tmp_re0,tmp_re1,tmp_re2,tmp_re3]
vec_tmp_im = [tmp_im0,tmp_im1,tmp_im2,tmp_im3]
Lastly, upon a store to memory, we need to
(4) interleave the results:
vec_tmp_1 = interleave_high (vec_tmp_re, vec_tmp_im)
vec_tmp_2 = interleave_low (vec_tmp_re, vec_tmp_im)
so now we have:
vec_tmp_1 = [tmp_re0,tmp_im0,tmp_re1,tmp_im1]
vec_tmp_2 = [tmp_re2,tmp_im2,tmp_re3,tmp_im3]
ready to be
(5) stored into array out:
vstore (vec_tmp_1, &out)
vstore (vec_tmp_1, &out[VF])