Based on some of my performance work recently, I'm growing uncomfortable with using gcc as the performance baseline since the results can be significantly different (sometimes 3-4x or more on certain fast algorithms) from clang and MSVC. The perf results on https://github.com/apache/arrow/pull/7506 were really surprising -- some benchmarks that showed 2-5x performance improvement on both clang and MSVC shows small regressions (20-30%) with gcc.
I don't think we need a hard-and-fast rule about whether to accept PRs based on benchmarks but there are a few guiding criteria: * How much binary size does the new code add? I think many of us would agree that a 20% performance increase on some algorithm might not be worth adding 500KB to libarrow.so * Is the code generally faster across the major compiler targets (gcc, clang, MSVC)? I think that using clang as a baseline for informational benchmarks would be good, but ultimately we need to be systematically collecting data on all the major compiilers. Some time ago I proposed building a Continuous Benchmarking framework (https://github.com/conbench/conbench/blob/master/doc/REQUIREMENTS.md) for use with Arrow (and outside of Arrow, too) so I hope that this will be able to help. - Wes On Mon, Jun 22, 2020 at 5:12 AM Yibo Cai <yibo....@arm.com> wrote: > > On 6/22/20 5:07 PM, Antoine Pitrou wrote: > > > > Le 22/06/2020 à 06:27, Micah Kornfield a écrit : > >> There has been significant effort recently trying to optimize our C++ > >> code. One thing that seems to come up frequently is different benchmark > >> results between GCC and Clang. Even different versions of the same > >> compiler can yield significantly different results on the same code. > >> > >> I would like to propose that we choose a specific compiler and version on > >> Linux for evaluating performance related PRs. PRs would only be accepted > >> if they improve the benchmarks under the selected version. > > > > Would this be a hard rule or just a guideline? There are many ways in > > which benchmark numbers can be improved or deteriorated by a PR, and in > > some cases that doesn't matter (benchmarks are not always realistic, and > > they are not representative of every workload). > > > > I agree that microbenchmark is not always useful, focusing too much on > improving microbenchmark result gives me feeling of "overfit" (to some > specific microarchitecture, compiler, or use case). > > > Regards > > > > Antoine. > >