[PATCH 0/9] Native complex operations

Sylvain Noiry via Gcc-patches Mon, 17 Jul 2023 02:03:44 -0700

Hi,

I have recently started a discussion about exposing complex operations directly
to the backends, to better exploit ISA with complex instructions. The title of 
the original message is "[RFC] Exposing complex numbers to target backends" [1].


This message starts a serie of 9 patches of the implementation that I've done. 
8 patches are about generic code, split by features. The last one is an 
experimental update of the x86 backend which exploits the newly exposed 
complex operations.

My original work was on the KVX backend from Kalray, where the ISA has complex
instructions. So I have obtained huge performance gains, on par with a code 
which uses builtins. On x86 there are gains without -ffast-math because less
calls to helpers are performed, but gains are marginal with -ffast-math due to
the lack of complex instructions.

[1] https://gcc.gnu.org/pipermail/gcc/2023-July/241998.html

Summary of the 9 patches:
  1/9: Conditional lowering of complex operations using the backend + update 
       on the TREE complex constants
  2/9: Move of read_complex_part and write_complex_part to target hooks to let
       the backend decide
  3/9: Add a gen_rtx_complex target hook to let the backend use its preferred 
       representation of complex in rtl
  4/9: Support and optimize the use of classical registers to represent complex
  5/9: Expose the conjugate operation down to the backend
  6/9: Expose and optimize complex rotations using internals functions and 
       conditional lowering
  7/9: Allow the vectorizer to work on complex type like it does on scalars
  8/9: Add explicit vectors of complex. This remains optional
  9/9: Experimental update on the x86 backend to exploit some of the previous
       features

The following sections explains the features added by each patch and 
illustrates them with examples on KVX, because of the backend support all the
new features. All examples are compiled with -O2 -ffast-math.

Patches 1 to 4 are required to have the minimal set of features which allows a
backend to exploit native complex operations.

PATCH 1/9: 
  - Change the TREE complex constants by adding a new field called "both" in 
    the tree_complex struct, which holds a vector of the real and imaginary 
    parts. This make the handling of constants during the cplxlower and expand
    passes easier. Any change to the one part will also affect the vector, 
    so very few changes are needed elsewhere.
  - Check in the optab for a complex pattern for almost all operations in the 
    cplxlower pass. The lowering is done only if an optab code is found. Some 
    conditions on presence on constants in the operands were also added, which 
    can be a subject of discussions.
  - Add a complex component for both parts in the cplxlower pass. When an 
    operation is lowered, the both part is recreated using a COMPLEX_EXPR. 
    When an operation is kept non-lowerd, real and imaginary parts are 
    extracted using REALPART_EXPR and IMAGPART_EXPR.

PATCH 2/9:
  - Move the inner implementation of read_complex_part and write_complex_part 
    to target hooks. This allows each backend to have its own implementation, 
    while the default ones are almost the same as before. Going back to 
    standard functions may be a point to discuss if no incompatible change are 
    done by the target to the default implementation.
  - Change the signature of read_complex_part and write_complex_part, to allow 
    both parts as a part. This affects all the calls to these functions.

PATCH 3/9:
  - Add a new target hook to replace gen_rtx_CONCAT when a complex element 
    needs to be created. The default implementation uses gen_rtx_CONCAT, but 
    the KVX implementation simply created a register with a complex type. 
    A previous attempt was to deal with generating_concat_p in gen_rtx_reg, 
    but no good solutions was found.

PATCH 4/9:
  - Adapt and optimize for the use of native complex operation in rtl, 
    aswell as register of complex types. After this patch, it's now possible 
    to re-implement the three new hooks and write some complex pattern. 
  
  Considering the following example:
  
    _Complex float mul(_Complex float a, _Complex float b)
    { 
      return a * b;
    }

  Previously, the generated code was:
    mul:
        copyw $r3 = $r0
        extfz $r5 = $r0, 32+32-1, 32 ; extract imag part
        ;;      # (end cycle 0)
        fmulw $r4 = $r3, $r1         ; float mul
        ;;      # (end cycle 1)
        fmulw $r2 = $r5, $r1         ; float mul
        extfz $r1 = $r1, 32+32-1, 32 ; extract real part
        ;;      # (end cycle 2)      
        ffmsw $r4 = $r5, $r1         ; float FMS
        ;;      # (end cycle 5)
        ffmaw $r2 = $r3, $r1         : float FMA
        ;;      # (end cycle 6)
        insf $r0 = $r4, 32+0-1, 0    ; insert real part
        ;;      # (end cycle 9)
        insf $r0 = $r2, 32+32-1, 32  ; insert imag part
        ret
        ;;      # (end cycle 10)

  The KVX has a complex float multiplication instruction, so now the result is:
    mul:
        fmulwc $r0 = $r0, $r1        ; float complex mul
        ret
        ;;      # (end cycle 0)

  Similar results are obtains for additions, subtractions, and negations.

  Moreover, if complex FMA and FMS patterns are defined, these operations will 
  be caught exactly like scalars. Indeed most passes of GCC will work out of 
  the box for complex types.


PATCH 5/9:
  - Expose the complex conjugate operation to backend. This is done through a 
    new conj optab and a new conj rtl operation. CONJ_EXPR can now be kept and 
    expanded if a conj pattern exists.

  considering the following example:
    
    _Complex float conjugate(_Complex float a)
      { 
        return ~a;
      }

  Previously, the generated code was:
    conjugate:
        extfz $r1 = $r0, 32+32-1, 32 ; extract imag part
        copyw $r2 = $r0
        ;;      # (end cycle 0)
        fnegw $r1 = $r1              ; float negate
        insf $r0 = $r2, 32+0-1, 0    ; insert real part
        ;;      # (end cycle 1)
        insf $r0 = $r1, 32+32-1, 32  ; insert imag part
        ret
        ;;      # (end cycle 2)

  Now, considering that the imag part is in the upper part of a 64-bit 
register, 
  the generated code is:
    conjugate:
        fnegd $r0 = $r0              ; double negate
        ret
        ;;      # (end cycle 0)

  The KVX can also conjugate one operand before doing an operation. The backend 
  programmer only has to write the pattern and the combiner pass is now able to
  catch it. For the following example:
 
     _Complex float mulc(_Complex float a, _Complex float b)
      { 
        return a * ~b;
      }

  The generated code is:
    mulc:
        fmulwc.c $r0 = $r1, $r0 ; complex mul with conj
        ret
        ;;      # (end cycle 0)

PATCH 6/9:
  - Catch complex rotation by 90° and 270° in fold-const.cc like before, but 
    now convert them into the new COMPLEX_ROT90 and COMPLEX_ROT270 internal 
    functions
  - Add crot90 and crot270 optabs to expose these operation the backends.
  - Conditionnaly lower COMPLEX_ROT90/COMPLEX_ROT270 by checking if 
    crot90/crot270 are in the optab
  - convert a + crot90/270(b) into cadd90/270(a, b) in a similar way than FMAs.
    This approach is different than the one implement a few years ago by 
    Tamar Christina. The main advantage is that it does not try to recognize 
    a rotation among the lowered operations, so there are less misses.

  Considering the following example:
    _Complex float crot90(_Complex float a)
      { 
        return a * I;
      }

  Previously the generated code was:
    rot90:
        extfz $r1 = $r0, 32+32-1, 32 ; extract imag part
        copyw $r2 = $r0 
        ;;      # (end cycle 0)
        fnegw $r1 = $r1              ; float negate
        ;;      # (end cycle 1)
        insf $r0 = $r1, 32+0-1, 0    ; insert real part
        ;;      # (end cycle 5)
        insf $r0 = $r2, 32+32-1, 32  ; insert imag part
        ret
        ;;      # (end cycle 6)

  Even if the KVX does not have a unique instruction for the complex rotation,
  a define_expand pattern can already bring gains:
    crot90:
        fnegd $r0 = $r0                     ; double negate
        ;;      # (end cycle 0)
        sbmm8 $r0 = $r0, 0x0804020180402010 ; swap real imag
        ret
        ;;      # (end cycle 1)


PATCH 7/9:
  - Add vector of complex inner types
  - Adapt or duplicate several functions and target hooks to deal with vectors 
    of complex and not just scalars.
  
  After these changes the vectorize works out of box for native complex 
  operations. Considering the following example:
    void
    fmaconjcplx (float complex a[restrict N], float complex b[restrict N],
                 float complex c[restrict N], float complex d[restrict N])
    {
      for (int i = 0; i < N; i++)
        d[i] = c[i] + a[i] * b[i];
    }

  The vectorizer has done its job and the KVX has some SIMD complex 
  instructions, so the result is:
    fmacplx:
        make $r4 = 0
        make $r5 = 32
        ;;      # (end cycle 0)
        loopdo $r5, .L56               ; begin hardware loop
        ;;      # (end cycle 1)
    .L53:
        lq.xs $r10r11 = $r4[$r1]       ; load 128-bit
        ;;      # (end cycle 0)
        lq.xs $r8r9 = $r4[$r0]         ; load 128-bit
        ;;      # (end cycle 1)
        lq.xs $r6r7 = $r4[$r2]         ; load 128-bit
        ;;      # (end cycle 2)
        ffmawcp $r6r7 = $r10r11, $r8r9 ; float complex FMA two lanes
        ;;      # (end cycle 5)
        sq.xs $r4[$r3] = $r6r7         ; store 128-bit
        addd $r4 = $r4, 1
        ;;      # (end cycle 8)
        # loopdo end
    .L56:
        ret

PATCH 8/9:
  - allow the creation of explicit complex vectors in C using 
    __attribute__ ((vector_size ()))

  This patch adds a feature in C, so it's optional and has more implications
  than just GCC.

PATCH 9/9:
  - Experimental support of complex operation in the x86 backend
  - Only in SCmode for now
  - Add (scalar) addition, subtraction, negation, conjugate, move, and mult

  Small examples can show interesting gains even if there is no complex 
  instruction on x86. However the patterns can be implemented using multiple
  instructions in a define_expand. Considering the same complex multiplication
  as the first example compiled with -mavx, the generated code can is:
    mul:
        vshufps $20, %xmm1, %xmm1, %xmm1
        vshufps $68, %xmm0, %xmm0, %xmm0 
        vmulps  %xmm1, %xmm0, %xmm0
        vshufps $13, %xmm0, %xmm0, %xmm1
        vshufps $8, %xmm0, %xmm0, %xmm0
        vaddsubps %xmm1, %xmm0, %xmm0
        ret

  Howether complete tests like an FFTs only show marginal gains. Huge gains can
  be although be obtained without -ffast-math. But it is still experimental and
  I am not an expert of the x86 backend, so improvement are certainly possible.

Thanks,

Sylvain

[PATCH 0/9] Native complex operations

Reply via email to