| Issue |
97580
|
| Summary |
[X86] Vector-Vector dot product not reduced to corresponding single instruction
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
Hendiadyoin1
|
Given the following cpp code snippets:
```c++
float simple_dot_product(f32x4 a, f32x4 b) {
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}
f32x4 dot_product_broadcast(f32x4 a, f32x4 b) {
float d = a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
f32x4 r = {d,d,d,d};
return r;
}
float selective_dot_product(f32x4 a, f32x4 b) {
return a[0] * b[0] + a[2] * b[2] + a[3] * b[3];
}
f32x4 selective_dot_product_selective_broadcast(f32x4 a, f32x4 b) {
float d = a[0] * b[0] + a[2] * b[2] + a[3] * b[3];
f32x4 r = {d,d,0,d};
return r;
}
```
clang/llvm fails to reduce these down to simple `dpps` (`DotProductPackedSingles`) instructions when SSE4.2 is enabled, similar might be true for the `double` case
Godbolt link with hopefully correct targets:
https://godbolt.org/z/od5ezWM19
Note that this might be affected by fp-accuracy affecting flags, such as `-fassociative-math` or `-ffp-contract=*`, as using the dot product instruction might yield higher accuracy (taking a look at https://www.felixcloutier.com/x86/dpps its a bit unclear if intermittent rounding is performed or if this acts as a sort of multiply-add type thing)
Also note that pre-multiplying `a` and `b` yields better codegen without `-ffast-math` or the like, as seen in the linked collection
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs