Issue 120015
Summary AVX mem broadcasts are cached on the stack
Labels new issue
Assignees
Reporter KyleSiefring
    After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading

Consider the following pseudo code:
```
float *restrict arr = ...; // prevent aliasing
loop {
     exhaust vector registers
     __mm256 x = _mm256_set1_ps(arr[0]);
     use x
}
```
When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.

```
 vbroadcastss    ymm0, dword ptr [rdx]
        vmovups ymmword ptr [rsp - 72], ymm0
loop:
        ...
        load x from stack
        use x
 jmp loop
```

The expected behavior is:
```
loop:
       ...
 vbroadcastss    x, dword ptr [rdx]
        use x
        jmp loop
```

Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to