>  grow to a fixed size number of centroids. Somewhere between 100-1000
generally.

Thanks, I was missing the upper bound here. I think this is reasonable
then, so no concerns from me. I'm generally aligned with the goal of not
cluttering the metrics space if we don't need to.

On Fri, Oct 3, 2025 at 2:47 PM Joey Tran <[email protected]> wrote:

>
>
> On Fri, Oct 3, 2025 at 10:47 AM Danny McCormick via dev <
> [email protected]> wrote:
>
>> Thanks - I had a few thoughts:
>>
>> 1) Exposing this data somehow seems helpful. I will note that we could
>> get the (estimated) standard deviation from the histogram metric today
>> without any modifications to the data stored by the metric.
>>
> 2) I'm a bit concerned that this would blow up the size of the
>> distribution metric, and I wonder if this would make sense as an
>> independent metric. Presumably we'd need something like `MergingDigest`
>> [1]. If I read the docs right, this could get quite large over time [2],
>> but I'm only somewhat proficient in stats :) The distribution metric is
>> quite small.
>>
>> Tdigests are made up of just a series of centroids and weights (so two
> floats), and grow to a fixed size number of centroids. Somewhere between
> 100-1000 generally. That only ends up being a few dozen KB at most, so it
> should be very small. In [2], it mentions that the centroids are 40 bytes
> so 1000 centroids would only be about 40kb in disk/memory. This seems small
> to me but size is relative and it's much bigger than the current
> distribution metric of course. I'm not sure what impacts a 40kb metric
> might have.
>
> (I admit that I am also no expert on tdigests myself and have just read
> docs/explanations and tried some benchmarks :)
>
> [2]
> https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html
>
>
>
>> What do you think about exposing this as a new metric type (vs extending
>> distribution)?
>>
>
> This would be the easiest and probably safest route, I'm just concerned
> about the growing number of metrics we have, especially metrics that are
> very similar. I think it's currently confusing as a user whether they
> should choose int64 distribution, double distribution, int histogram, or
> now tdigestdistribution. They're all really just describing distributions
> and if the information included in a tdigest could come for "free" (as in,
> with an unobservably small cost in performance/memory/disk), then it'd make
> sense to include it whenever a user adds a distribution metric
>
>
>
>>
>> Thanks,
>> Danny
>>
>> [1]
>> https://github.com/apache/beam/blob/45c36901859bd5448f7b9d34ca47083feb14d713/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.java#L313
>> [2]
>> https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html
>>
>> On Fri, Oct 3, 2025 at 9:11 AM Joey Tran <[email protected]> wrote:
>>
>>> Just want to bump this in case anyone has any feedback.
>>>
>>> On Wed, Oct 1, 2025 at 9:56 AM Joey Tran <[email protected]> wrote:
>>>
>>>> Hey all,
>>>>
>>>> Our distribution metric is fairly coarse in how it describes a
>>>> distribution with just min, max, sum, mean, and count. There is currently
>>>> very little information on the actual shape of the distribution. I'd like
>>>> to propose a couple of improvements.
>>>>
>>>> As a small improvement, I think it would be nice to include stdev. As a
>>>> large improvement, we could use tdigests to create a more granular
>>>> distribution. This shouldn't be too hard to support since we already have a
>>>> TDigest[1] java transform  which we can probably adapt for the java sdk
>>>> harness; and we can use the fastdigest[2] python library for extending the
>>>> python SDK harness
>>>>
>>>> I propose that we extend the current encoding [3] of the distribution
>>>> metrics. The current encoding is:
>>>>     // Encoding: <count><sum><min><max>
>>>>     //   - count: beam:coder:varint:v1
>>>>     //   - sum:   beam:coder:double:v1
>>>>     //   - min:   beam:coder:double:v1
>>>>     //   - max:   beam:coder:double:v1
>>>>
>>>> I suggest just appending to it an encoding of a new `TDigestData` proto
>>>> that'd include information on the tdigest centroids and options (e.g. max
>>>> centroids). This can be an optional field so we wouldn't need to update all
>>>> SDKs at once. Similarly, runners will have the option to ignore the tdigest
>>>> option (which by default they will currently). The benefit of this is that
>>>> we don't need to expand the user metrics API - users will just get better
>>>> distribution information as their sdks/runners implement tdigest support.
>>>>
>>>> Looking for any thoughts or feedback on this idea.
>>>>
>>>> Cheers,
>>>> Joey
>>>>
>>>>
>>>> [1]
>>>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
>>>> [2]
>>>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
>>>> [3]
>>>> https://github.com/apache/beam/blob/75866588752de0c47136fde173944bd57c323401/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/metrics.proto#L534
>>>>
>>>

Reply via email to