Issue 172807
Summary Improve vectorization of complex multiplication using shuffle-based ADDSUB pattern
Labels new issue
Assignees
Reporter ParkHanbum
    
I would like to propose an optimization for complex number multiplication patterns. Currently, the vectorizer (or scalar optimizer) often leaves complex multiplication as a sequence of extractelement, scalar arithmetic, and insertelement.

I suggest transforming this pattern into a vector-friendly form using splats and shuffles, which maps efficiently to hardware instructions like x86 vaddsubpd.

Current Behavior (Scalar/Naive)
The mathematical representation of complex multiplication (a0 + a1*i) * (b0 + b1*i) is currently represented as:

```
Real: a0*b0 - a1*b1
Imag: a0*b1 + a1*b0
Result: { Real, Imag }
```
  
In LLVM IR, this often results in a chain of scalar operations requiring costly element extraction and insertion.

Proposed Behavior (Vectorized)
We can vectorize this operation by preparing two intermediate vectors (sa and sb) and combining them using a subtraction/addition mix.

Given input vectors a = {a0, a1} and b = {b0, b1}:

    Generate sa (Real Splat * Original):

        Splat a0 (Real part of A) -> {a0, a0}

        Multiply by b -> {b0, b1}

        sa = {a0*b0, a0*b1}

    Generate sb (Imag Splat * Swapped):

        Splat a1 (Imag part of A) -> {a1, a1}

        Swap b -> {b1, b0}

 Multiply them

        sb = {a1*b1, a1*b0}

    Combine (ADDSUB semantics):

        Perform subtraction on the even index (Real) and addition on the odd index (Imag).

        Result = { sa[0] - sb[0], sa[1] + sb[1] }

Summary of Logic
```
Previous:
  { a0*b0 - a1*b1 , a0*b1 + a1*b0 }

Suggest:
  sa = {a0*b0, a0*b1}
  sb = {a1*b1, a1*b0}
  Result = { sa[0] - sb[0], sa[1] + sb[1] }
```

example: 
https://rust.godbolt.org/z/PTo8T1EjG


extra:
performance analyse

Suggest : 
```
Iterations:    100
Instructions:   700
Total Cycles:   209
Total uOps:    800

Dispatch Width:  6
uOps Per Cycle: 3.83
IPC:        3.35
Block RThroughput: 1.5
```

before LLVM13:
```
Iterations:    100
Instructions:   700
Total Cycles: 804
Total uOps:    800

Dispatch Width:  6
uOps Per Cycle:  1.00
IPC: 0.87
Block RThroughput: 1.5
```

associate : [#58139](https://github.com/llvm/llvm-project/issues/58139)

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to