Issue |
142046
|
Summary |
64 bit shifts on 32 bit targets fail to utilize shift range information
|
Labels |
new issue
|
Assignees |
|
Reporter |
tom-rein
|
When we know that a shift is greater than 32, we can translate a 64 bit shift into a 32 bit shift and set the lower register to 0.
Clang fails to do this optimization in most cases. This happens at least on ARM, RISC-V and x86, others probably also.
Here are some examples ([godbolt](https://godbolt.org/z/9Tdf1xKxd)):
```C++
#include <cstdint>
uint64_t shift_assume_ge32(uint64_t x, int s) {
__builtin_assume(s >= 32);
return x << s;
}
uint64_t shift_assume_bit5(uint64_t x, int s) {
__builtin_assume((s | 32) == s);
return x << s;
}
uint64_t shift_set_bit5(uint64_t x, int s) {
s |= 32;
return x << s;
}
uint64_t shift_set_bit5_assume_set(uint64_t x, int s) {
__builtin_assume((s | 32) == s);
s |= 32;
return x << s;
}
uint64_t shift_32_plus(uint64_t x, int s) {
__builtin_assume(s >= 0);
s += 32;
return x << s;
}
uint64_t shift_lt32(uint64_t x, int s) {
__builtin_assume(s < 32);
return x << s;
}
uint64_t shift_mask31(uint64_t x, int s) {
s &= 31;
return x << s;
}
uint64_t shift_assume_masked(uint64_t x, int s) {
__builtin_assume((s & 31) == s);
return x << s;
}
uint64_t shift_mask31_assume_masked(uint64_t x, int s) {
__builtin_assume((s & 31) == s);
s &= 31;
return x << s;
}
```
Assuming that the shift amount is greater than 32 or the 5th bit is set has no effect (`shift_assume_ge32()` and `shift_assume_bit5 `).
Similarly assuming it is less than 32 or the bit is unset also has no effect (`shift_lt32()` and `shift_assume_masked()`).
Setting the 5th bit or masking by 31 (`shift_set_bit5()` and `shift_mask31()`) result in correctly simplified output.
Unless we assume the value is in range (`shift_set_bit5_assume_set()` and `shift_mask31_assume_masked()`).
Then the bit masking is correctly removed, but clang forgets the range info and emits a general shift again.
Sometimes we have a conditional shift by 32, which on a 32 bit target is just a conditional move, but clang replaces it with a general 64 bit shift, which sometimes gets only partially simplified. In particular on RISC-V it fails to realize that the left shifts are nops.
```C++
uint64_t conditional_shift32(uint64_t x, unsigned b) {
__builtin_assume(b <= 1);
return b ? x << 32 : x;
}
uint64_t shift_shifted_condition(uint64_t x, unsigned b) {
__builtin_assume(b <= 1);
return x << (32 * b);
}
uint64_t conditional_shift_bit5(uint64_t x, unsigned b) {
return (b & 32) ? x << 32 : x;
}
uint64_t conditional_shift_bit5_assume(uint64_t x, unsigned b) {
__builtin_assume((b & 32) == b);
return (b & 32) ? x << 32 : x;
}
uint64_t conditional_shift_assume_bit5(uint64_t x, unsigned b) {
__builtin_assume((b & 32) == b);
return b ? x << 32 : x;
}
```
For `conditional_shift_bit5()` and `conditional_shift_bit5_assume()` clang sees that it is just a shift by `b`, but in `conditional_shift_assume_bit5()` it fails, even though it is only missing the `& 32`, which gets removed anyways.
The above also made me realize, that the 64 shift for 32 bit RISC-V is super weird:
```
shift(unsigned long long, int):
addi a4, a2, -32
sll a3, a0, a2
bltz a4, .LBB0_2
mv a1, a3
srai a0, a4, 31
and a0, a0, a3
ret
.LBB0_2:
sll a1, a1, a2
not a2, a2
srli a0, a0, 1
srl a0, a0, a2
or a1, a1, a0
srai a0, a4, 31
and a0, a0, a3
ret
```
The `srai + and` before the `ret`s are totally useless. The first is always 0 and the other is always a0.
With the Zicond extension it is slightly better:
```
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srli a4, a0, 1
sll a0, a0, a2
addi a2, a2, -32
srl a3, a4, a3
slti a2, a2, 0
or a1, a1, a3
czero.nez a3, a0, a2
czero.eqz a1, a1, a2
or a1, a1, a3
czero.eqz a0, a0, a2
ret
```
The `addi` is still there, even though it could have been combined with the `slti`. Also a minor nitpick, if the `srli` was placed after the `srl`, we could avoid using a4 and use a compressible `srli`.
This is how I would write it:
```
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srl a3, a0, a3
sll a0, a0, a2
srli a3, a3, 1
slti a2, a2, 32
or a1, a1, a3
czero.nez a3, a0, a2
czero.eqz a1, a1, a2
or a1, a1, a3
czero.eqz a0, a0, a2
```
(The `slti` could also be replaced with `srli a2, a2, 5`, which has a compressed version, whereas if I'm not mistaken `slti` does not)
This can be rewritten to use only base instructions with only two more instructions:
```
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srl a3, a0, a3
sll a0, a0, a2
srli a3, a3, 1
addi a2, a2, -32
srai a2, a2, 31
or a1, a1, a3
and a1, a1, a2
not a3, a2
and a3, a3, a0
or a1, a1, a3
and a0, a0, a2
```
(The `addi` could be replaced with `slli a2, a2, 26`, but I'm not sure which is better)
This to me is preferable over a branchy version from a side channel perspective alone, since it is highly unintuitive that a shift introduces a branch.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs