Issue 173172
Summary [x86] A negation + multiply-add can't be fused across a shufflevector
Labels new issue
Assignees
Reporter valadaptive
    While many ISAs expose "fused multiply-subtract" and "fused negate-multiply-add" operations, LLVM IR offers only an `llvm.fma` intrinsic. The expectation is that user code will manually perform a negation and multiply-add, and LLVM's various backends will fuse those into a single instruction if the ISA supports it. [Rust's own architecture-specific intrinsics rely on this pattern now.](https://github.com/rust-lang/stdarch/blob/61119062fb9be522df5bd81ff0974d6c1e887dbc/crates/core_arch/src/x86/fma.rs#L396)

In many cases, however, LLVM will rearrange or hoist the `fneg` operation in a way that the backends cannot recognize. For instance, if you perform a `shufflevector` followed by a `fneg`, LLVM's frontend is smart enough to realize that performing the `fneg` before or after the `shufflevector` is equivalent, and hoist it. The backends, however, are not as clever. They will therefore fail to fuse the `fneg` + `shufflevector` + `llvm.fma` into a single fused negate-multiply-add.

[Here's a Compiler Explorer demo.](https://godbolt.org/z/YazdEhvY8) There are three functions that all perform two fused negate-multiply-add operations on their operands. Their *behavior* is identical, but they generate different code.

The first version of the function is autovectorized, and combines the `[f32; 4]` operands into 256-bit AVX2 vectors before performing a single 256-bit fused multiply-add. You would expect this to be a negate-multiply-add, especially because negation has no native instruction on x86 and requires loading an operand from memory. Instead, the `fneg` is hoisted above the `shufflevector` that combines the operands, leaving the backend unable to emit a single `vfnmadd`.

If this were an issue with the autovectorizer, you'd expect the third version of the function to do better. It explicitly lays out the order of operations, combining all input vectors into wider versions *before* doing any explicit arithmetic. However, LLVM is again too clever by half and hoists the `fneg` above the `shufflevector`. Once again, we end up loading an operand from memory and performing extra operations when we could be getting the negation "for free" in a `vfnmadd` instruction.

Only the second version of the function avoids this, since it explicitly performs two separate 128-bit operations and LLVM does not fuse these.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to