Hi all, part of our work while creating benchmarks for Beam is to collect total data size (bytes) that was put inside the testing pipeline. We need that in load tests of core beam operations (to see how big was the load really) and IO tests (to calculate throughput). The "not so good" way we're doing it right now is that we add a DoFn step called "ByteMonitor" to the pipeline to get the size of every element using a utility called "ObjectSizeCalculator [1].
Problems with this approach: 1. It's computationally expensive. After introducing this change, tests are 5x slower than before. This is due to the fact that now the size of each record is calculated separately. 2. Naturally, the size of a particular record measured this way is greater than the size of the generated key+values itself. Eg. if a synthetic source generates key + value that has 10 bytes total, after collecting the total bytes metric it's 8x greater (due to wrapping the value in richer objects, allocating more memory than needed, etc). The main question here is: which size of particular records is more interesting in benchmarks? The, let's call it, "net" size (key + value size, and nothing else), or the "gross" size (including all allocated memory for a particular element in PCollection and all the overhead of wrapping it in richer objects)? Maybe both sizes are good to be measured? For the "net" size we probably could (should?) do something similar to what Nexmark suites have: pre-define size per each element type and read it once the element is spotted in the pipeline [3]. What do you think? Is there any other (efficient + reliable) way of measuring the total load size that I missed? Thanks for opinions! Best, Łukasz [1] https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java [2] https://issues.apache.org/jira/browse/BEAM-7431 [3] https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java