Hey all,
Our distribution metric is fairly coarse in how it describes a distribution
with just min, max, sum, mean, and count. There is currently very little
information on the actual shape of the distribution. I'd like to propose a
couple of improvements.
As a small improvement, I think it would be nice to include stdev. As a
large improvement, we could use tdigests to create a more granular
distribution. This shouldn't be too hard to support since we already have a
TDigest[1] java transform which we can probably adapt for the java sdk
harness; and we can use the fastdigest[2] python library for extending the
python SDK harness
I propose that we extend the current encoding [3] of the distribution
metrics. The current encoding is:
// Encoding: <count><sum><min><max>
// - count: beam:coder:varint:v1
// - sum: beam:coder:double:v1
// - min: beam:coder:double:v1
// - max: beam:coder:double:v1
I suggest just appending to it an encoding of a new `TDigestData` proto
that'd include information on the tdigest centroids and options (e.g. max
centroids). This can be an optional field so we wouldn't need to update all
SDKs at once. Similarly, runners will have the option to ignore the tdigest
option (which by default they will currently). The benefit of this is that
we don't need to expand the user metrics API - users will just get better
distribution information as their sdks/runners implement tdigest support.
Looking for any thoughts or feedback on this idea.
Cheers,
Joey
[1]
https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
[2]
https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
[3]
https://github.com/apache/beam/blob/75866588752de0c47136fde173944bd57c323401/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/metrics.proto#L534