Issue |
120015
|
Summary |
AVX mem broadcasts are cached on the stack
|
Labels |
new issue
|
Assignees |
|
Reporter |
KyleSiefring
|
After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading
Consider the following pseudo code:
```
float *restrict arr = ...; // prevent aliasing
loop {
exhaust vector registers
__mm256 x = _mm256_set1_ps(arr[0]);
use x
}
```
When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.
```
vbroadcastss ymm0, dword ptr [rdx]
vmovups ymmword ptr [rsp - 72], ymm0
loop:
...
load x from stack
use x
jmp loop
```
The expected behavior is:
```
loop:
...
vbroadcastss x, dword ptr [rdx]
use x
jmp loop
```
Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs