On Fri, Oct 3, 2025 at 10:47 AM Danny McCormick via dev <[email protected]> wrote:
> Thanks - I had a few thoughts: > > 1) Exposing this data somehow seems helpful. I will note that we could get > the (estimated) standard deviation from the histogram metric today without > any modifications to the data stored by the metric. > 2) I'm a bit concerned that this would blow up the size of the distribution > metric, and I wonder if this would make sense as an independent metric. > Presumably we'd need something like `MergingDigest` [1]. If I read the docs > right, this could get quite large over time [2], but I'm only somewhat > proficient in stats :) The distribution metric is quite small. > > Tdigests are made up of just a series of centroids and weights (so two floats), and grow to a fixed size number of centroids. Somewhere between 100-1000 generally. That only ends up being a few dozen KB at most, so it should be very small. In [2], it mentions that the centroids are 40 bytes so 1000 centroids would only be about 40kb in disk/memory. This seems small to me but size is relative and it's much bigger than the current distribution metric of course. I'm not sure what impacts a 40kb metric might have. (I admit that I am also no expert on tdigests myself and have just read docs/explanations and tried some benchmarks :) [2] https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html > What do you think about exposing this as a new metric type (vs extending > distribution)? > This would be the easiest and probably safest route, I'm just concerned about the growing number of metrics we have, especially metrics that are very similar. I think it's currently confusing as a user whether they should choose int64 distribution, double distribution, int histogram, or now tdigestdistribution. They're all really just describing distributions and if the information included in a tdigest could come for "free" (as in, with an unobservably small cost in performance/memory/disk), then it'd make sense to include it whenever a user adds a distribution metric > > Thanks, > Danny > > [1] > https://github.com/apache/beam/blob/45c36901859bd5448f7b9d34ca47083feb14d713/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.java#L313 > [2] > https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html > > On Fri, Oct 3, 2025 at 9:11 AM Joey Tran <[email protected]> wrote: > >> Just want to bump this in case anyone has any feedback. >> >> On Wed, Oct 1, 2025 at 9:56 AM Joey Tran <[email protected]> wrote: >> >>> Hey all, >>> >>> Our distribution metric is fairly coarse in how it describes a >>> distribution with just min, max, sum, mean, and count. There is currently >>> very little information on the actual shape of the distribution. I'd like >>> to propose a couple of improvements. >>> >>> As a small improvement, I think it would be nice to include stdev. As a >>> large improvement, we could use tdigests to create a more granular >>> distribution. This shouldn't be too hard to support since we already have a >>> TDigest[1] java transform which we can probably adapt for the java sdk >>> harness; and we can use the fastdigest[2] python library for extending the >>> python SDK harness >>> >>> I propose that we extend the current encoding [3] of the distribution >>> metrics. The current encoding is: >>> // Encoding: <count><sum><min><max> >>> // - count: beam:coder:varint:v1 >>> // - sum: beam:coder:double:v1 >>> // - min: beam:coder:double:v1 >>> // - max: beam:coder:double:v1 >>> >>> I suggest just appending to it an encoding of a new `TDigestData` proto >>> that'd include information on the tdigest centroids and options (e.g. max >>> centroids). This can be an optional field so we wouldn't need to update all >>> SDKs at once. Similarly, runners will have the option to ignore the tdigest >>> option (which by default they will currently). The benefit of this is that >>> we don't need to expand the user metrics API - users will just get better >>> distribution information as their sdks/runners implement tdigest support. >>> >>> Looking for any thoughts or feedback on this idea. >>> >>> Cheers, >>> Joey >>> >>> >>> [1] >>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html >>> [2] >>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html >>> [3] >>> https://github.com/apache/beam/blob/75866588752de0c47136fde173944bd57c323401/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/metrics.proto#L534 >>> >>
