Sylvain,
Is this on a branch in your github repository
https://github.com/kalray/gcc
somewhere ?
That would make it easier to test it for me (and probably others).
See for instance my mail here (d.d. Thu Oct 5 14:45:05 GMT 2023):
https://gcc.gnu.org/pipermail/gcc/2023-October/242643.html
Thanks in advance.
Kind regards,
Toon Moene.
On 10/16/23 11:14, Sylvain Noiry via Gcc wrote:
Hi,
We are trying to update our patches on complex numbers to take into
account what has been discussed.
The main change from our previous patches consists of replacing vectors
of complex types with classical vectors of real types (ex V4SF instead
of V2SC) associated with existing complex opcodes (like .COMPLEX_MUL)
when vectorizing. Non vectored complex modes are also replaced by
vectors of two reals at the end of the middle-end (ex SC to V2SF), so
that it can reuse already existing patterns. Indeed, non complex
specific operations like an addition does not require an specific
pattern anymore, and already implementing patterns like cmul, cmul_conj,
cadd90,... can be used.
To do so, the cplxlower pass has been cut into two passes:
- The first one replace complex specific opcodes with dedicated
opcodes (like .COMPLEX_MUL replacing MUL_EXPR with SC mode), but complex
modes are kept at this point. Unsupported native operations are also
lowered, because we assume that it's better to lower and hope for
standard optimizations in the middle-end than trying to vectorize with
near-zero chance, and then lower only after.
- The second one almost only remaps non vectored complex modes into
vector of two reals (like SC to V2SF).
So the vectorizer takes complex modes as input but vectorize with
vectors of real modes (ex V4SF vector mode for SC). Because complex
specific opcodes have been set before, no confusion with real operations
is possible. We also may use vectors of two reals as inputs, but
vectorizing small vector modes into bigger ones (like V2SF to V4SF) is
not possible.
Here are some advantages of this new approach:
- No more vectors of complex modes
- The vectorization of complex operations is improved, because split
and unified vectored statements can easely be mixed as it uses the same
vector type. We can also imagine to test multiple options (First: native
vectored, second: split vectored, third: unified scalar,...).
- It reuses patterns for vectors of two reals for non complex
specific operations, and also already existing complex patterns like
cmul implemented on aarch64, which could mean almost free performance
gains on many targets.
On the performance side, we can still exploit the full potential of
complex instructions on KVX. To illustrate the gains on aarch64 without
rewriting any patterns (except a mov), here is the assembly generated
for a vector complex mul mul add with -O2 -mcpu=neoverse-v1 (and without
ffast-math like with SLP):
void vfmma (_Complex float a[restrict N], _Complex float b[restrict N],
_Complex float c[restrict N], _Complex float
d[restrict N])
{
for (int i = 0; i < N; i++)
c[i] += a[i] * b[i] * d[i];
}
vfmma:
movi v3.4s, 0
mov x4, 0
.align 5
.L2:
ldr q2, [x1, x4]
mov v1.16b, v3.16b
ldr q0, [x0, x4]
fcmla v1.4s, v0.4s, v2.4s, #0
fcmla v1.4s, v0.4s, v2.4s, #90
ldr q0, [x2, x4]
ldr q2, [x3, x4]
fcmla v0.4s, v2.4s, v1.4s, #0
fcmla v0.4s, v2.4s, v1.4s, #90
str q0, [x2, x4]
add x4, x4, 16
cmp x4, 256
bne .L2
ret
We have only done some experimentation with this approach. If you think
that it could be interesting we will try to develop it more.
Thanks,
Sylvain
--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands