On Fri, Oct 3, 2025 at 10:47 AM Danny McCormick via dev <[email protected]>
wrote:

> Thanks - I had a few thoughts:
>
> 1) Exposing this data somehow seems helpful. I will note that we could get
> the (estimated) standard deviation from the histogram metric today without
> any modifications to the data stored by the metric.
>
2) I'm a bit concerned that this would blow up the size of the distribution
> metric, and I wonder if this would make sense as an independent metric.
> Presumably we'd need something like `MergingDigest` [1]. If I read the docs
> right, this could get quite large over time [2], but I'm only somewhat
> proficient in stats :) The distribution metric is quite small.
>
> Tdigests are made up of just a series of centroids and weights (so two
floats), and grow to a fixed size number of centroids. Somewhere between
100-1000 generally. That only ends up being a few dozen KB at most, so it
should be very small. In [2], it mentions that the centroids are 40 bytes
so 1000 centroids would only be about 40kb in disk/memory. This seems small
to me but size is relative and it's much bigger than the current
distribution metric of course. I'm not sure what impacts a 40kb metric
might have.

(I admit that I am also no expert on tdigests myself and have just read
docs/explanations and tried some benchmarks :)

[2]
https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html



> What do you think about exposing this as a new metric type (vs extending
> distribution)?
>

This would be the easiest and probably safest route, I'm just concerned
about the growing number of metrics we have, especially metrics that are
very similar. I think it's currently confusing as a user whether they
should choose int64 distribution, double distribution, int histogram, or
now tdigestdistribution. They're all really just describing distributions
and if the information included in a tdigest could come for "free" (as in,
with an unobservably small cost in performance/memory/disk), then it'd make
sense to include it whenever a user adds a distribution metric



>
> Thanks,
> Danny
>
> [1]
> https://github.com/apache/beam/blob/45c36901859bd5448f7b9d34ca47083feb14d713/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.java#L313
> [2]
> https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html
>
> On Fri, Oct 3, 2025 at 9:11 AM Joey Tran <[email protected]> wrote:
>
>> Just want to bump this in case anyone has any feedback.
>>
>> On Wed, Oct 1, 2025 at 9:56 AM Joey Tran <[email protected]> wrote:
>>
>>> Hey all,
>>>
>>> Our distribution metric is fairly coarse in how it describes a
>>> distribution with just min, max, sum, mean, and count. There is currently
>>> very little information on the actual shape of the distribution. I'd like
>>> to propose a couple of improvements.
>>>
>>> As a small improvement, I think it would be nice to include stdev. As a
>>> large improvement, we could use tdigests to create a more granular
>>> distribution. This shouldn't be too hard to support since we already have a
>>> TDigest[1] java transform  which we can probably adapt for the java sdk
>>> harness; and we can use the fastdigest[2] python library for extending the
>>> python SDK harness
>>>
>>> I propose that we extend the current encoding [3] of the distribution
>>> metrics. The current encoding is:
>>>     // Encoding: <count><sum><min><max>
>>>     //   - count: beam:coder:varint:v1
>>>     //   - sum:   beam:coder:double:v1
>>>     //   - min:   beam:coder:double:v1
>>>     //   - max:   beam:coder:double:v1
>>>
>>> I suggest just appending to it an encoding of a new `TDigestData` proto
>>> that'd include information on the tdigest centroids and options (e.g. max
>>> centroids). This can be an optional field so we wouldn't need to update all
>>> SDKs at once. Similarly, runners will have the option to ignore the tdigest
>>> option (which by default they will currently). The benefit of this is that
>>> we don't need to expand the user metrics API - users will just get better
>>> distribution information as their sdks/runners implement tdigest support.
>>>
>>> Looking for any thoughts or feedback on this idea.
>>>
>>> Cheers,
>>> Joey
>>>
>>>
>>> [1]
>>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
>>> [2]
>>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
>>> [3]
>>> https://github.com/apache/beam/blob/75866588752de0c47136fde173944bd57c323401/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/metrics.proto#L534
>>>
>>

Reply via email to