Hi, I have recently started a discussion about exposing complex operations directly to the backends, to better exploit ISA with complex instructions. The title of the original message is "[RFC] Exposing complex numbers to target backends" [1].
This message starts a serie of 9 patches of the implementation that I've done. 8 patches are about generic code, split by features. The last one is an experimental update of the x86 backend which exploits the newly exposed complex operations. My original work was on the KVX backend from Kalray, where the ISA has complex instructions. So I have obtained huge performance gains, on par with a code which uses builtins. On x86 there are gains without -ffast-math because less calls to helpers are performed, but gains are marginal with -ffast-math due to the lack of complex instructions. [1] https://gcc.gnu.org/pipermail/gcc/2023-July/241998.html Summary of the 9 patches: 1/9: Conditional lowering of complex operations using the backend + update on the TREE complex constants 2/9: Move of read_complex_part and write_complex_part to target hooks to let the backend decide 3/9: Add a gen_rtx_complex target hook to let the backend use its preferred representation of complex in rtl 4/9: Support and optimize the use of classical registers to represent complex 5/9: Expose the conjugate operation down to the backend 6/9: Expose and optimize complex rotations using internals functions and conditional lowering 7/9: Allow the vectorizer to work on complex type like it does on scalars 8/9: Add explicit vectors of complex. This remains optional 9/9: Experimental update on the x86 backend to exploit some of the previous features The following sections explains the features added by each patch and illustrates them with examples on KVX, because of the backend support all the new features. All examples are compiled with -O2 -ffast-math. Patches 1 to 4 are required to have the minimal set of features which allows a backend to exploit native complex operations. PATCH 1/9: - Change the TREE complex constants by adding a new field called "both" in the tree_complex struct, which holds a vector of the real and imaginary parts. This make the handling of constants during the cplxlower and expand passes easier. Any change to the one part will also affect the vector, so very few changes are needed elsewhere. - Check in the optab for a complex pattern for almost all operations in the cplxlower pass. The lowering is done only if an optab code is found. Some conditions on presence on constants in the operands were also added, which can be a subject of discussions. - Add a complex component for both parts in the cplxlower pass. When an operation is lowered, the both part is recreated using a COMPLEX_EXPR. When an operation is kept non-lowerd, real and imaginary parts are extracted using REALPART_EXPR and IMAGPART_EXPR. PATCH 2/9: - Move the inner implementation of read_complex_part and write_complex_part to target hooks. This allows each backend to have its own implementation, while the default ones are almost the same as before. Going back to standard functions may be a point to discuss if no incompatible change are done by the target to the default implementation. - Change the signature of read_complex_part and write_complex_part, to allow both parts as a part. This affects all the calls to these functions. PATCH 3/9: - Add a new target hook to replace gen_rtx_CONCAT when a complex element needs to be created. The default implementation uses gen_rtx_CONCAT, but the KVX implementation simply created a register with a complex type. A previous attempt was to deal with generating_concat_p in gen_rtx_reg, but no good solutions was found. PATCH 4/9: - Adapt and optimize for the use of native complex operation in rtl, aswell as register of complex types. After this patch, it's now possible to re-implement the three new hooks and write some complex pattern. Considering the following example: _Complex float mul(_Complex float a, _Complex float b) { return a * b; } Previously, the generated code was: mul: copyw $r3 = $r0 extfz $r5 = $r0, 32+32-1, 32 ; extract imag part ;; # (end cycle 0) fmulw $r4 = $r3, $r1 ; float mul ;; # (end cycle 1) fmulw $r2 = $r5, $r1 ; float mul extfz $r1 = $r1, 32+32-1, 32 ; extract real part ;; # (end cycle 2) ffmsw $r4 = $r5, $r1 ; float FMS ;; # (end cycle 5) ffmaw $r2 = $r3, $r1 : float FMA ;; # (end cycle 6) insf $r0 = $r4, 32+0-1, 0 ; insert real part ;; # (end cycle 9) insf $r0 = $r2, 32+32-1, 32 ; insert imag part ret ;; # (end cycle 10) The KVX has a complex float multiplication instruction, so now the result is: mul: fmulwc $r0 = $r0, $r1 ; float complex mul ret ;; # (end cycle 0) Similar results are obtains for additions, subtractions, and negations. Moreover, if complex FMA and FMS patterns are defined, these operations will be caught exactly like scalars. Indeed most passes of GCC will work out of the box for complex types. PATCH 5/9: - Expose the complex conjugate operation to backend. This is done through a new conj optab and a new conj rtl operation. CONJ_EXPR can now be kept and expanded if a conj pattern exists. considering the following example: _Complex float conjugate(_Complex float a) { return ~a; } Previously, the generated code was: conjugate: extfz $r1 = $r0, 32+32-1, 32 ; extract imag part copyw $r2 = $r0 ;; # (end cycle 0) fnegw $r1 = $r1 ; float negate insf $r0 = $r2, 32+0-1, 0 ; insert real part ;; # (end cycle 1) insf $r0 = $r1, 32+32-1, 32 ; insert imag part ret ;; # (end cycle 2) Now, considering that the imag part is in the upper part of a 64-bit register, the generated code is: conjugate: fnegd $r0 = $r0 ; double negate ret ;; # (end cycle 0) The KVX can also conjugate one operand before doing an operation. The backend programmer only has to write the pattern and the combiner pass is now able to catch it. For the following example: _Complex float mulc(_Complex float a, _Complex float b) { return a * ~b; } The generated code is: mulc: fmulwc.c $r0 = $r1, $r0 ; complex mul with conj ret ;; # (end cycle 0) PATCH 6/9: - Catch complex rotation by 90° and 270° in fold-const.cc like before, but now convert them into the new COMPLEX_ROT90 and COMPLEX_ROT270 internal functions - Add crot90 and crot270 optabs to expose these operation the backends. - Conditionnaly lower COMPLEX_ROT90/COMPLEX_ROT270 by checking if crot90/crot270 are in the optab - convert a + crot90/270(b) into cadd90/270(a, b) in a similar way than FMAs. This approach is different than the one implement a few years ago by Tamar Christina. The main advantage is that it does not try to recognize a rotation among the lowered operations, so there are less misses. Considering the following example: _Complex float crot90(_Complex float a) { return a * I; } Previously the generated code was: rot90: extfz $r1 = $r0, 32+32-1, 32 ; extract imag part copyw $r2 = $r0 ;; # (end cycle 0) fnegw $r1 = $r1 ; float negate ;; # (end cycle 1) insf $r0 = $r1, 32+0-1, 0 ; insert real part ;; # (end cycle 5) insf $r0 = $r2, 32+32-1, 32 ; insert imag part ret ;; # (end cycle 6) Even if the KVX does not have a unique instruction for the complex rotation, a define_expand pattern can already bring gains: crot90: fnegd $r0 = $r0 ; double negate ;; # (end cycle 0) sbmm8 $r0 = $r0, 0x0804020180402010 ; swap real imag ret ;; # (end cycle 1) PATCH 7/9: - Add vector of complex inner types - Adapt or duplicate several functions and target hooks to deal with vectors of complex and not just scalars. After these changes the vectorize works out of box for native complex operations. Considering the following example: void fmaconjcplx (float complex a[restrict N], float complex b[restrict N], float complex c[restrict N], float complex d[restrict N]) { for (int i = 0; i < N; i++) d[i] = c[i] + a[i] * b[i]; } The vectorizer has done its job and the KVX has some SIMD complex instructions, so the result is: fmacplx: make $r4 = 0 make $r5 = 32 ;; # (end cycle 0) loopdo $r5, .L56 ; begin hardware loop ;; # (end cycle 1) .L53: lq.xs $r10r11 = $r4[$r1] ; load 128-bit ;; # (end cycle 0) lq.xs $r8r9 = $r4[$r0] ; load 128-bit ;; # (end cycle 1) lq.xs $r6r7 = $r4[$r2] ; load 128-bit ;; # (end cycle 2) ffmawcp $r6r7 = $r10r11, $r8r9 ; float complex FMA two lanes ;; # (end cycle 5) sq.xs $r4[$r3] = $r6r7 ; store 128-bit addd $r4 = $r4, 1 ;; # (end cycle 8) # loopdo end .L56: ret PATCH 8/9: - allow the creation of explicit complex vectors in C using __attribute__ ((vector_size ())) This patch adds a feature in C, so it's optional and has more implications than just GCC. PATCH 9/9: - Experimental support of complex operation in the x86 backend - Only in SCmode for now - Add (scalar) addition, subtraction, negation, conjugate, move, and mult Small examples can show interesting gains even if there is no complex instruction on x86. However the patterns can be implemented using multiple instructions in a define_expand. Considering the same complex multiplication as the first example compiled with -mavx, the generated code can is: mul: vshufps $20, %xmm1, %xmm1, %xmm1 vshufps $68, %xmm0, %xmm0, %xmm0 vmulps %xmm1, %xmm0, %xmm0 vshufps $13, %xmm0, %xmm0, %xmm1 vshufps $8, %xmm0, %xmm0, %xmm0 vaddsubps %xmm1, %xmm0, %xmm0 ret Howether complete tests like an FFTs only show marginal gains. Huge gains can be although be obtained without -ffast-math. But it is still experimental and I am not an expert of the x86 backend, so improvement are certainly possible. Thanks, Sylvain