Re: [proposal] Extending distribution metric to use tdigests

Danny McCormick via dev Sat, 18 Oct 2025 12:20:14 -0700

Thanks - I had a few thoughts:

1) Exposing this data somehow seems helpful. I will note that we could get
the (estimated) standard deviation from the histogram metric today without
any modifications to the data stored by the metric.
2) I'm a bit concerned that this would blow up the size of the distribution
metric, and I wonder if this would make sense as an independent metric.
Presumably we'd need something like `MergingDigest` [1]. If I read the docs
right, this could get quite large over time [2], but I'm only somewhat
proficient in stats :) The distribution metric is quite small.


What do you think about exposing this as a new metric type (vs extending
distribution)?

Thanks,
Danny

[1]
https://github.com/apache/beam/blob/45c36901859bd5448f7b9d34ca47083feb14d713/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.java#L313
[2]
https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html

On Fri, Oct 3, 2025 at 9:11 AM Joey Tran <[email protected]> wrote:

> Just want to bump this in case anyone has any feedback.
>
> On Wed, Oct 1, 2025 at 9:56 AM Joey Tran <[email protected]> wrote:
>
>> Hey all,
>>
>> Our distribution metric is fairly coarse in how it describes a
>> distribution with just min, max, sum, mean, and count. There is currently
>> very little information on the actual shape of the distribution. I'd like
>> to propose a couple of improvements.
>>
>> As a small improvement, I think it would be nice to include stdev. As a
>> large improvement, we could use tdigests to create a more granular
>> distribution. This shouldn't be too hard to support since we already have a
>> TDigest[1] java transform  which we can probably adapt for the java sdk
>> harness; and we can use the fastdigest[2] python library for extending the
>> python SDK harness
>>
>> I propose that we extend the current encoding [3] of the distribution
>> metrics. The current encoding is:
>>     // Encoding: <count><sum><min><max>
>>     //   - count: beam:coder:varint:v1
>>     //   - sum:   beam:coder:double:v1
>>     //   - min:   beam:coder:double:v1
>>     //   - max:   beam:coder:double:v1
>>
>> I suggest just appending to it an encoding of a new `TDigestData` proto
>> that'd include information on the tdigest centroids and options (e.g. max
>> centroids). This can be an optional field so we wouldn't need to update all
>> SDKs at once. Similarly, runners will have the option to ignore the tdigest
>> option (which by default they will currently). The benefit of this is that
>> we don't need to expand the user metrics API - users will just get better
>> distribution information as their sdks/runners implement tdigest support.
>>
>> Looking for any thoughts or feedback on this idea.
>>
>> Cheers,
>> Joey
>>
>>
>> [1]
>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
>> [2]
>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
>> [3]
>> https://github.com/apache/beam/blob/75866588752de0c47136fde173944bd57c323401/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/metrics.proto#L534
>>
>

Re: [proposal] Extending distribution metric to use tdigests

Reply via email to