[DISCUSS] Macro-benchmarking for performance tuning and regression detection

Greg Hogan Wed, 06 Apr 2016 09:57:23 -0700

I'd like to discuss the creation of a macro-benchmarking module for Flink.
This could be run during pre-release testing to detect performance
regressions and during development when refactoring or performance tuning
code on the hot path.

Many users have published benchmarks and the Flink libraries already
contain a modest selection of algorithms. Some benefits of creating a
consolidated collection of macro-benchmarks include:

- comprehensive code coverage: a diverse set of algorithms can stress every
aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...)

- codify best practices: benchmarks should be relatively stable and
repeatable

- efficient: an automated system can run many more tests and generate more
accurate results

Macro-benchmarks would be useful in analyzing improved performance with the
proposed specialized serializes and comparators [FLINK-3599] or making
Flink NUMA-aware [FLINK-3163].

I've also been looking recently at some of the hot code and see about a
~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
to bitshift and bitmask rather than divide and modulo. The trade-off is
that to align on a power-of-2 we have holes in and require additional
MemoryBuffers. And I'm testing on a single data type, IntValue, and there
may be different results for LongValue or StringValue or custom types or
with different algorithms. And replacing multiply with a left shift reduces
performance, demonstrating the need to test changes in isolation.

There are many more ideas, i.e. NormalizedKeySorter writing keys before the
pointer so that the offset computation is performed outside of the compare
and sort methods. Also, SpanningRecordSerializer could skip to the next
buffer rather than writing length across buffers. These changes might each
be worth a few percent. Other changes might be less than a 1% speedup, but
taken in aggregate will yield a noticeable performance increase.

I like the idea of profile first, measure second, then create and discuss
the pull request.

As for the actual macro-benchmarking framework, it would be nice if the
algorithms would also verify correctness alongside performance. The
algorithm interface would be warmup (run only once) and execute, which
would be run multiple times in an interleaved manner. There benchmarking
duration should be tunable.

The framework would be responsible for configuration of as well as starting
and stopping the cluster, executing algorithms and recording performance,
and comparing and analyzing results.

Greg

[DISCUSS] Macro-benchmarking for performance tuning and regression detection

Reply via email to