Hi,
We are trying to update our patches on complex numbers to take into
account what has been discussed.
The main change from our previous patches consists of replacing
vectors of complex types with classical vectors of real types (ex
V4SF instead of V2SC) associated with existing complex opcodes (like
.COMPLEX_MUL) when vectorizing. Non vectored complex modes are also
replaced by vectors of two reals at the end of the middle-end (ex SC
to V2SF), so that it can reuse already existing patterns. Indeed,
non complex specific operations like an addition does not require an
specific pattern anymore, and already implementing patterns like
cmul, cmul_conj, cadd90,... can be used.
To do so, the cplxlower pass has been cut into two passes:
- The first one replace complex specific opcodes with dedicated
opcodes (like .COMPLEX_MUL replacing MUL_EXPR with SC mode), but
complex modes are kept at this point. Unsupported native operations
are also lowered, because we assume that it's better to lower and
hope for standard optimizations in the middle-end than trying to
vectorize with near-zero chance, and then lower only after.
- The second one almost only remaps non vectored complex modes
into vector of two reals (like SC to V2SF).
So the vectorizer takes complex modes as input but vectorize with
vectors of real modes (ex V4SF vector mode for SC). Because complex
specific opcodes have been set before, no confusion with real
operations is possible. We also may use vectors of two reals as
inputs, but vectorizing small vector modes into bigger ones (like
V2SF to V4SF) is not possible.
Here are some advantages of this new approach:
- No more vectors of complex modes
- The vectorization of complex operations is improved, because
split and unified vectored statements can easely be mixed as it uses
the same vector type. We can also imagine to test multiple options
(First: native vectored, second: split vectored, third: unified
scalar,...).
- It reuses patterns for vectors of two reals for non complex
specific operations, and also already existing complex patterns like
cmul implemented on aarch64, which could mean almost free performance
gains on many targets.
On the performance side, we can still exploit the full potential of
complex instructions on KVX. To illustrate the gains on aarch64
without rewriting any patterns (except a mov), here is the assembly
generated for a vector complex mul mul add with -O2 -mcpu=neoverse-v1
(and without ffast-math like with SLP):
void vfmma (_Complex float a[restrict N], _Complex float b[restrict N],
_Complex float c[restrict N], _Complex float
d[restrict N])
{
for (int i = 0; i < N; i++)
c[i] += a[i] * b[i] * d[i];
}
vfmma:
movi v3.4s, 0
mov x4, 0
.align 5
.L2:
ldr q2, [x1, x4]
mov v1.16b, v3.16b
ldr q0, [x0, x4]
fcmla v1.4s, v0.4s, v2.4s, #0
fcmla v1.4s, v0.4s, v2.4s, #90
ldr q0, [x2, x4]
ldr q2, [x3, x4]
fcmla v0.4s, v2.4s, v1.4s, #0
fcmla v0.4s, v2.4s, v1.4s, #90
str q0, [x2, x4]
add x4, x4, 16
cmp x4, 256
bne .L2
ret
We have only done some experimentation with this approach. If you
think that it could be interesting we will try to develop it more.
Thanks,
Sylvain