Hi all,

part of our work while creating benchmarks for Beam is to collect total
data size (bytes) that was put inside the testing pipeline. We need that in
load tests of core beam operations (to see how big was the load really) and
IO tests (to calculate throughput). The "not so good" way we're doing it
right now is that we add a DoFn step called "ByteMonitor" to the pipeline
to get the size of every element using a utility called
"ObjectSizeCalculator [1].

Problems with this approach:
1. It's computationally expensive. After introducing this change, tests are
5x slower than before. This is due to the fact that now the size of each
record is calculated separately.
2. Naturally, the size of a particular record measured this way is greater
than the size of the generated key+values itself. Eg. if a synthetic source
generates key + value that has 10 bytes total, after collecting the total
bytes metric it's 8x greater (due to wrapping the value in richer objects,
allocating more memory than needed, etc).

The main question here is: which size of particular records is more
interesting in benchmarks? The, let's call it, "net" size (key + value
size, and nothing else), or the "gross" size (including all allocated
memory for a particular element in PCollection and all the overhead of
wrapping it in richer objects)? Maybe both sizes are good to be measured?

For the "net" size we probably could (should?) do something similar to what
Nexmark suites have: pre-define size per each element type and read it once
the element is spotted in the pipeline [3].

What do you think? Is there any other (efficient + reliable) way of
measuring the total load size that I missed?

Thanks for opinions!

Best,
Łukasz

[1]
https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java

[2] https://issues.apache.org/jira/browse/BEAM-7431
[3]
https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java

Reply via email to