Hello Toon,

the implementation is not finished, we have just made some tests for now.

If no one sees huge problems with this new approach, we will continue to implement and stabilize it.

Thank you for your interest !

Sylvain

On 10/17/23 22:37, Toon Moene wrote:
Sylvain,

Is this on a branch in your github repository

    https://github.com/kalray/gcc

somewhere ?

That would make it easier to test it for me (and probably others).

See for instance my mail here (d.d. Thu Oct 5 14:45:05 GMT 2023):

https://gcc.gnu.org/pipermail/gcc/2023-October/242643.html

Thanks in advance.

Kind regards,

Toon Moene.

On 10/16/23 11:14, Sylvain Noiry via Gcc wrote:

Hi,

We are trying to update our patches on complex numbers to take into account what has been discussed.

The main change from our previous patches consists of replacing vectors of complex types with classical vectors of real types (ex V4SF instead of V2SC) associated with existing complex opcodes (like .COMPLEX_MUL) when vectorizing.  Non vectored complex modes are also replaced by vectors of two reals at the end of the middle-end (ex SC to V2SF), so that it can reuse already existing patterns.  Indeed, non complex specific operations like an addition does not require an specific pattern anymore, and already implementing patterns like cmul, cmul_conj, cadd90,... can be used.

To do so, the cplxlower pass has been cut into two passes:
   - The first one replace complex specific opcodes with dedicated opcodes (like .COMPLEX_MUL replacing MUL_EXPR with SC mode), but complex modes are kept at this point.  Unsupported native operations are also lowered, because we assume that it's better to lower and hope for standard optimizations in the middle-end than trying to vectorize with near-zero chance, and then lower only after.    - The second one almost only remaps non vectored complex modes into vector of two reals (like SC to V2SF).

So the vectorizer takes complex modes as input but vectorize with vectors of real modes (ex V4SF vector mode for SC). Because complex specific opcodes have been set before, no confusion with real operations is possible. We also may use vectors of two reals as inputs, but vectorizing small vector modes into bigger ones (like V2SF to V4SF) is not possible.

Here are some advantages of this new approach:
   - No more vectors of complex modes
   - The vectorization of complex operations is improved, because split and unified vectored statements can easely be mixed as it uses the same vector type. We can also imagine to test multiple options (First: native vectored, second: split vectored, third: unified scalar,...).    - It reuses patterns for vectors of two reals for non complex specific operations, and also already existing complex patterns like cmul implemented on aarch64, which could mean almost free performance gains on many targets.

On the performance side, we can still exploit the full potential of complex instructions on KVX.  To illustrate the gains on aarch64 without rewriting any patterns (except a mov), here is the assembly generated for a vector complex mul mul add with -O2 -mcpu=neoverse-v1 (and without ffast-math like with SLP):

void vfmma (_Complex float a[restrict N], _Complex float b[restrict N],
                      _Complex float c[restrict N], _Complex float d[restrict N])
{
   for (int i = 0; i < N; i++)
     c[i] += a[i] * b[i] * d[i];
}


vfmma:
         movi    v3.4s, 0
         mov     x4, 0
         .align  5
.L2:
         ldr     q2, [x1, x4]
         mov     v1.16b, v3.16b
         ldr     q0, [x0, x4]
         fcmla   v1.4s, v0.4s, v2.4s, #0
         fcmla   v1.4s, v0.4s, v2.4s, #90
         ldr     q0, [x2, x4]
         ldr     q2, [x3, x4]
         fcmla   v0.4s, v2.4s, v1.4s, #0
         fcmla   v0.4s, v2.4s, v1.4s, #90
         str     q0, [x2, x4]
         add     x4, x4, 16
         cmp     x4, 256
         bne     .L2
         ret

We have only done some experimentation with this approach.  If you think that it could be interesting we will try to develop it more.

Thanks,

Sylvain










Reply via email to