Hello, I think that creating a macro-benchmarking module would be a very good idea. It would make doing performance-related changes much easier and safer.
I have also used Peel, and can confirm that it would be a good fit for this task. > I've also been looking recently at some of the hot code and see about a > ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap > to bitshift and bitmask rather than divide and modulo. The trade-off is > that to align on a power-of-2 we have holes in and require additional > MemoryBuffers. I've also noticed the performance problem that those divisons in NormalizedKeySorter.compare/swap cause, and have an idea about eliminating them without the aligning to power-of-2 trade-off. I've opened a Jira [1], where I explain it. Best, Gábor [1] https://issues.apache.org/jira/browse/FLINK-3722 2016-04-06 18:56 GMT+02:00 Greg Hogan <c...@greghogan.com>: > I'd like to discuss the creation of a macro-benchmarking module for Flink. > This could be run during pre-release testing to detect performance > regressions and during development when refactoring or performance tuning > code on the hot path. > > Many users have published benchmarks and the Flink libraries already > contain a modest selection of algorithms. Some benefits of creating a > consolidated collection of macro-benchmarks include: > > - comprehensive code coverage: a diverse set of algorithms can stress every > aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...) > > - codify best practices: benchmarks should be relatively stable and > repeatable > > - efficient: an automated system can run many more tests and generate more > accurate results > > Macro-benchmarks would be useful in analyzing improved performance with the > proposed specialized serializes and comparators [FLINK-3599] or making > Flink NUMA-aware [FLINK-3163]. > > I've also been looking recently at some of the hot code and see about a > ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap > to bitshift and bitmask rather than divide and modulo. The trade-off is > that to align on a power-of-2 we have holes in and require additional > MemoryBuffers. And I'm testing on a single data type, IntValue, and there > may be different results for LongValue or StringValue or custom types or > with different algorithms. And replacing multiply with a left shift reduces > performance, demonstrating the need to test changes in isolation. > > There are many more ideas, i.e. NormalizedKeySorter writing keys before the > pointer so that the offset computation is performed outside of the compare > and sort methods. Also, SpanningRecordSerializer could skip to the next > buffer rather than writing length across buffers. These changes might each > be worth a few percent. Other changes might be less than a 1% speedup, but > taken in aggregate will yield a noticeable performance increase. > > I like the idea of profile first, measure second, then create and discuss > the pull request. > > As for the actual macro-benchmarking framework, it would be nice if the > algorithms would also verify correctness alongside performance. The > algorithm interface would be warmup (run only once) and execute, which > would be run multiple times in an interleaved manner. There benchmarking > duration should be tunable. > > The framework would be responsible for configuration of as well as starting > and stopping the cluster, executing algorithms and recording performance, > and comparing and analyzing results. > > Greg