Re: [proposal] Extending distribution metric to use tdigests

Kenneth Knowles Tue, 07 Oct 2025 10:35:17 -0700

It has been a while for me, so I'm not sure how metrics have evolved since
earlier discussions but this discussion makes me want to at least add
historical context.


The intention behind metrics is to decouple (1) reporting (2) storage and
(3) querying.

The Distribution metric, initially had this design (
https://github.com/apache/beam/commit/8524ed9545f5af4bdeb54601f333549b34eb35aa#diff-09adcf69a2518e5ba471a09203f3f0758e092038e4c1330d5b3e5951917422d1
)

(1) reporting longs
(2) runner stores a summary however it likes
(3) ability to query a few key stats, with intention to expand

The method `update(sum, count, min, max)` was introduced in
https://github.com/apache/beam/pull/7183. There are a few design docs
around it (https://s.apache.org/beam-fn-api-metrics,
https://s.apache.org/get-metrics-api,
https://s.apache.org/beam-histogram-metrics). I don't know if it was an
explicit decision or not, but these designs, and the Fn API started to
treat all of (1), (2), (3) the same, making a metric just data+combiner.

In a Fn API world item (1) needs to include a high-throughput proto format
but I don't recall and I cannot find a rationale for the change. I actually
think the lack of a clear canonical original design doc is probably the
culprit. I don't think we knew we were doing a total overhaul to the
concept.

I can't tell at a glance whether we have completely lost the ability to get
back to the more flexible and abstract design or not. I think something
powerful is lost otherwise. But on the other hand, a simpler model of "the
reporting, storage, and query are the same" is easy to understand, amenable
to runner-independent implementation, and might align well with other
frameworks which is sort of the most important thing. The less Beam has to
do here before plugging in to a more specialized system, the better. I'm
not even sure querying via Beam makes sense.

Kenn

On Fri, Oct 3, 2025 at 3:29 PM Danny McCormick via dev <[email protected]>
wrote:

> >  grow to a fixed size number of centroids. Somewhere between 100-1000
> generally.
>
> Thanks, I was missing the upper bound here. I think this is reasonable
> then, so no concerns from me. I'm generally aligned with the goal of not
> cluttering the metrics space if we don't need to.
>
> On Fri, Oct 3, 2025 at 2:47 PM Joey Tran <[email protected]> wrote:
>
>>
>>
>> On Fri, Oct 3, 2025 at 10:47 AM Danny McCormick via dev <
>> [email protected]> wrote:
>>
>>> Thanks - I had a few thoughts:
>>>
>>> 1) Exposing this data somehow seems helpful. I will note that we could
>>> get the (estimated) standard deviation from the histogram metric today
>>> without any modifications to the data stored by the metric.
>>>
>> 2) I'm a bit concerned that this would blow up the size of the
>>> distribution metric, and I wonder if this would make sense as an
>>> independent metric. Presumably we'd need something like `MergingDigest`
>>> [1]. If I read the docs right, this could get quite large over time [2],
>>> but I'm only somewhat proficient in stats :) The distribution metric is
>>> quite small.
>>>
>>> Tdigests are made up of just a series of centroids and weights (so two
>> floats), and grow to a fixed size number of centroids. Somewhere between
>> 100-1000 generally. That only ends up being a few dozen KB at most, so it
>> should be very small. In [2], it mentions that the centroids are 40 bytes
>> so 1000 centroids would only be about 40kb in disk/memory. This seems small
>> to me but size is relative and it's much bigger than the current
>> distribution metric of course. I'm not sure what impacts a 40kb metric
>> might have.
>>
>> (I admit that I am also no expert on tdigests myself and have just read
>> docs/explanations and tried some benchmarks :)
>>
>> [2]
>> https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html
>>
>>
>>
>>> What do you think about exposing this as a new metric type (vs extending
>>> distribution)?
>>>
>>
>> This would be the easiest and probably safest route, I'm just concerned
>> about the growing number of metrics we have, especially metrics that are
>> very similar. I think it's currently confusing as a user whether they
>> should choose int64 distribution, double distribution, int histogram, or
>> now tdigestdistribution. They're all really just describing distributions
>> and if the information included in a tdigest could come for "free" (as in,
>> with an unobservably small cost in performance/memory/disk), then it'd make
>> sense to include it whenever a user adds a distribution metric
>>
>>
>>
>>>
>>> Thanks,
>>> Danny
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/45c36901859bd5448f7b9d34ca47083feb14d713/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.java#L313
>>> [2]
>>> https://artifacts.elastic.co/javadoc/org/elasticsearch/elasticsearch-tdigest/8.11.3/org.elasticsearch.tdigest/org/elasticsearch/tdigest/MergingDigest.html
>>>
>>> On Fri, Oct 3, 2025 at 9:11 AM Joey Tran <[email protected]> wrote:
>>>
>>>> Just want to bump this in case anyone has any feedback.
>>>>
>>>> On Wed, Oct 1, 2025 at 9:56 AM Joey Tran <[email protected]> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> Our distribution metric is fairly coarse in how it describes a
>>>>> distribution with just min, max, sum, mean, and count. There is currently
>>>>> very little information on the actual shape of the distribution. I'd like
>>>>> to propose a couple of improvements.
>>>>>
>>>>> As a small improvement, I think it would be nice to include stdev. As
>>>>> a large improvement, we could use tdigests to create a more granular
>>>>> distribution. This shouldn't be too hard to support since we already have 
>>>>> a
>>>>> TDigest[1] java transform  which we can probably adapt for the java sdk
>>>>> harness; and we can use the fastdigest[2] python library for extending the
>>>>> python SDK harness
>>>>>
>>>>> I propose that we extend the current encoding [3] of the distribution
>>>>> metrics. The current encoding is:
>>>>>     // Encoding: <count><sum><min><max>
>>>>>     //   - count: beam:coder:varint:v1
>>>>>     //   - sum:   beam:coder:double:v1
>>>>>     //   - min:   beam:coder:double:v1
>>>>>     //   - max:   beam:coder:double:v1
>>>>>
>>>>> I suggest just appending to it an encoding of a new `TDigestData`
>>>>> proto that'd include information on the tdigest centroids and options 
>>>>> (e.g.
>>>>> max centroids). This can be an optional field so we wouldn't need to 
>>>>> update
>>>>> all SDKs at once. Similarly, runners will have the option to ignore the
>>>>> tdigest option (which by default they will currently). The benefit of this
>>>>> is that we don't need to expand the user metrics API - users will just get
>>>>> better distribution information as their sdks/runners implement tdigest
>>>>> support.
>>>>>
>>>>> Looking for any thoughts or feedback on this idea.
>>>>>
>>>>> Cheers,
>>>>> Joey
>>>>>
>>>>>
>>>>> [1]
>>>>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
>>>>> [2]
>>>>> https://beam.apache.org/releases/javadoc/2.5.0/index.html?org/apache/beam/sdk/extensions/sketching/TDigestQuantiles.html
>>>>> [3]
>>>>> https://github.com/apache/beam/blob/75866588752de0c47136fde173944bd57c323401/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/metrics.proto#L534
>>>>>
>>>>

Re: [proposal] Extending distribution metric to use tdigests

Reply via email to