Thanks - I had a few thoughts: 1) Exposing this data somehow seems helpful. I will note that we could get the (estimated) standard deviation from the histogram metric today without any modifications to the data stored by the metric. 2) I'm a bit concerned that this would blow up the size of the distribution metric, and I wonder if this would make sense as an independent metric. Presumably we'd need something like `MergingDigest` [1]. If I read the docs right, this could get quite large over time [2], but I'm only somewhat proficient in stats :) The distribution metric is quite small.
What do you think about exposing this as a new metric type (vs extending distribution)? Thanks, Danny [1] https://github.com/apache/beam/blob/45c36901859bd5448f7b9d34ca47083feb14d713/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.java#L313 [2] https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html On Fri, Oct 3, 2025 at 9:11 AM Joey Tran <[email protected]> wrote: > Just want to bump this in case anyone has any feedback. > > On Wed, Oct 1, 2025 at 9:56 AM Joey Tran <[email protected]> wrote: > >> Hey all, >> >> Our distribution metric is fairly coarse in how it describes a >> distribution with just min, max, sum, mean, and count. There is currently >> very little information on the actual shape of the distribution. I'd like >> to propose a couple of improvements. >> >> As a small improvement, I think it would be nice to include stdev. As a >> large improvement, we could use tdigests to create a more granular >> distribution. This shouldn't be too hard to support since we already have a >> TDigest[1] java transform which we can probably adapt for the java sdk >> harness; and we can use the fastdigest[2] python library for extending the >> python SDK harness >> >> I propose that we extend the current encoding [3] of the distribution >> metrics. The current encoding is: >> // Encoding: <count><sum><min><max> >> // - count: beam:coder:varint:v1 >> // - sum: beam:coder:double:v1 >> // - min: beam:coder:double:v1 >> // - max: beam:coder:double:v1 >> >> I suggest just appending to it an encoding of a new `TDigestData` proto >> that'd include information on the tdigest centroids and options (e.g. max >> centroids). This can be an optional field so we wouldn't need to update all >> SDKs at once. Similarly, runners will have the option to ignore the tdigest >> option (which by default they will currently). The benefit of this is that >> we don't need to expand the user metrics API - users will just get better >> distribution information as their sdks/runners implement tdigest support. >> >> Looking for any thoughts or feedback on this idea. >> >> Cheers, >> Joey >> >> >> [1] >> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html >> [2] >> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html >> [3] >> https://github.com/apache/beam/blob/75866588752de0c47136fde173944bd57c323401/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/metrics.proto#L534 >> >
