https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280
Bug ID: 84280 Summary: Performance regression in g++-7 with Eigen for non-AVX2 CPUs Product: gcc Version: 7.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: patrikhuber at gmail dot com Target Milestone: --- Hello, I noticed today what may look like quite a large performance regression with Eigen (3.3.4) matrix multiplication. It only seems to occur on non-AVX2 code paths, meaning that if I compile with -march=native on my core-i7 with AVX2, then it's blazingly fast on both g++ versions, but not on an older core-i5 with only AVX, or if I use -march=core2. Here are some example timings, but it applies to all matrix sizes that the benchmark script tests (see end of the message for the code): g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3 -o gcc5_gemm_test 1124 1215 1465 elapsed_ms: 1970 -------- 1730 1235 1758 elapsed_ms: 3505 g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3 -march=core2 -o gcc7_gemm_test 1124 1215 1465 elapsed_ms: 2998 -------- 1730 1235 1758 elapsed_ms: 4628 It's even worse if I test this on a i5-3550, which has AVX, but not AVX2: g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc5_gemm_test 1124 1215 1465 elapsed_ms: 941 -------- 1730 1235 1758 elapsed_ms: 1780 g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc7_gemm_test 1124 1215 1465 elapsed_ms: 1988 -------- 1730 1235 1758 elapsed_ms: 3740 I tried the same with -O2 and it gave the same results. That's a drop to nearly half the speed in matrix multiplication on AVX CPUs. Or maybe I've done something wrong. :-) I realise the benchmark might be a bit crude (better use Google Benchmark or something like that...) But the results I'm getting are pretty consistent on various CPUs, compilers, and with various flags. === Benchmark code: // gemm_test.cpp #include <array> #include <chrono> #include <iostream> #include <random> #include <Eigen/Dense> using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>; using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>; template <typename Mat> void run_test(const std::string& name, int s1, int s2, int s3) { using namespace std::chrono; float checksum = 0.0f; // to prevent compiler from optimizing everything away const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count(); for (size_t i = 0; i < 10; ++i) { Mat a_rm(s1, s2); Mat b_rm(s2, s3); const auto c_rm = a_rm * b_rm; checksum += c_rm(0, 0); } const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count(); const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000; std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl; } int main() { //std::random_device rd; //std::mt19937 gen(0); //std::uniform_int_distribution<> dis(1, 2048); std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116, 1736, 868, 1278, 1323, 788 }; for (std::size_t i = 0; i < 12; ++i) { int s1 = vals[i++];//dis(gen); int s2 = vals[i++];//dis(gen); int s3 = vals[i];//dis(gen); std::cout << s1 << " " << s2 << " " << s3 << std::endl; run_test<ColMajorMatrixXf>("col major", s1, s2, s3); run_test<RowMajorMatrixXf>("row major", s1, s2, s3); std::cout << "--------" << std::endl; } return 0; } ===