Issue |
123262
|
Summary |
[clang++][aarch64] help optimize __builtin_mul_overflow performance
|
Labels |
clang
|
Assignees |
|
Reporter |
eric-yq
|
Hi team, I have a sample code compiling with clang++, it shows 10 times slower than g++.
The main performance issue is located in function `__builtin_mul_overflow under clang++`
Can you help give some suggestions ? I do not want to use both g++ and clang++ in my CICD pipeline.
Compiling command and `Time taken` comparation: `( 0.22 seconds vs. 0.02 seconds. )`
```c
# Ubuntu 24.04, g++ 13.3 and clang++ 18.1.3
# Server:AWS c7g.xlarge(AWS Graviton3, Neoverse-V1)
# g++ -std=c++17 -O3 -march=armv8-a+crc testint.cpp -o testint-g++
# ./testint-g++
Time taken for 10000000 iterations: 0.0208047 seconds
Sum of results: 9747553088193654009
# clang++ -std=c++17 -O3 -march=armv8-a+crc testint.cpp -o testint-clang++ --rtlib=compiler-rt
# ./testint-clang++
Time taken for 10000000 iterations: 0.226598 seconds //// ( 0.22 seconds vs. 0.02 seconds. )
Sum of results: 18269431752893742105
```
Sample code: testint.cpp
```c
#include <iostream>
#include <chrono>
#include <random>
#include <cstdint>
#include <vector>
// 定义 128 位整数类型(如果编译器支持)
using int128_t = __int128;
// 被基准测试的函数
inline bool int128_mul_overflow(int128_t a, int128_t b, volatile int128_t* c) {
return __builtin_mul_overflow(a, b, c);
}
// 随机生成 128 位整数
int128_t generate_random_int128() {
static std::mt19937_64 rng(std::random_device{}());
std::uniform_int_distribution<uint64_t> dist(0, std::numeric_limits<uint64_t>::max());
// 生成两个 64 位整数,并将它们组合成一个 128 位整数
int128_t high = static_cast<int128_t>(dist(rng));
int128_t low = static_cast<int128_t>(dist(rng));
return (high << 64) | low;
}
// 生成随机数据并存储在 vector 中
std::vector<std::pair<int128_t, int128_t>> generate_random_data(int count) {
std::vector<std::pair<int128_t, int128_t>> data;
data.reserve(count);
for (int i = 0; i < count; ++i) {
int128_t a = generate_random_int128();
int128_t b = generate_random_int128();
data.emplace_back(a, b);
}
return data;
}
// 基准测试函数
void benchmark_int128_mul_overflow(const std::vector<std::pair<int128_t, int128_t>>& data) {
int128_t c = 0;
int128_t sum = 0; // 用于累加结果
auto start = std::chrono::high_resolution_clock::now();
for (const auto& pair : data) {
if (int128_mul_overflow(pair.first, pair.second, &c)) {
sum += c; // 累加结果以防止优化
}
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
std::cout << "Time taken for " << data.size() << " iterations: " << duration.count() << " seconds\n";
std::cout << "Sum of results: " << static_cast<uint64_t>(sum) << "\n"; // 输出累加结果
}
int main() {
int iterations = 10000000; // 可以根据需要调整迭代次数
auto data = ""
benchmark_int128_mul_overflow(data);
return 0;
}
```
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs