Re: About the Gauge metric API

Reuven Lax Sat, 07 Apr 2018 12:31:15 -0700

Interesting point. Some runners might implement metrics via a control
plane, in which case per-key stuff is problematic for large numbers of
keys. However other runners may decide to implement metrics inside the
graph itself (e.g. by generating a CombinePerKey), in which case per-key
aggregation is less scary. Limits will be runner dependent.


On Fri, Apr 6, 2018 at 12:51 PM Robert Bradshaw <[email protected]> wrote:

> On Fri, Apr 6, 2018 at 12:23 PM Ben Chambers <[email protected]> wrote:
>
>> Generally strong +1 to everything Bill said. I would suggest though that
>> the per-worker segmentation might be specified using some more general
>> tagging/labeling API. For instance, all of the following seem like
>> reasonable uses to support:
>>
>> 1. Gauge that is tagged with worker to get per-worker segmentation (such
>> as queue size, memory usage, etc.)
>> 2. Gauge that is tagged with the "key" being processed. Would be useful
>> for things like how much data is buffered, where are watermark holds for
>> each key, etc. If processing is partitioned by key, this is strictly more
>> granular than per-worker.
>> 3. Counter tagged with the "key" being processed. Would be useful for
>> time spent processing each key, etc.
>>
>
> Per-key stuff gets really dangerous, as then the counter (control) plane
> has O(#keys) items to keep track of. That is unless it is paired with some
> kind of a lossy top/histogram aggregation.
>
> However, I think Bill hits the nail on the head that there is an implicit
> (ill-defined, in the model at least) segmentation desired here, with
> different aggregations happening within vs. across segments. (Also, FWIW,
> clobber is not the only aggregation that makes sense at the lowest level.)
> Default counters use the same aggregation across both levels, giving useful
> and well-defined semantics regardless of bundling and work distribution
> (assuming associative aggregation of course), but perhaps the
> counter/metrics APIs could be augmented to be able to explicitly request
> the level and differing aggregation with respect to this segmentation.
>
>
>> On Fri, Apr 6, 2018 at 11:39 AM Bill Neubauer <[email protected]> wrote:
>>
>>> Thanks for unraveling those themes, Pablo!
>>>
>>> 1. Seems reasonable in light of behaviors metrics backends support.
>>> 2. Those same backends support histogramming of data, so having integer
>>> types is very useful.
>>> 3. I believe that is the case, for the reasons I mentioned earlier,
>>> Gauges should only clobber previously values reported by the same entity.
>>> Two workers with the same gauge should not be overwriting each other's
>>> values, only their own. This implies per-worker segmentation.
>>>
>>>
>>> On Fri, Apr 6, 2018 at 11:35 AM Pablo Estrada <[email protected]>
>>> wrote:
>>>
>>>> Nobody wants to get rid of Gauges. I see that we have three separate
>>>> themes being discussed here, and I think it's useful to point them out and
>>>> address them independently:
>>>>
>>>> 1. Whether Gauges should change to hold string values.
>>>> 2. If Gauges are to support string values, whether Gauges should also
>>>> continue to have an int API.
>>>> 3. Whether Beam should support some sort of label / tag / worker-id for
>>>> aggregation of Gauges (maybe other metrics?)
>>>>
>>>> -P.
>>>>
>>>> On Fri, Apr 6, 2018 at 11:21 AM Ben Chambers <[email protected]>
>>>> wrote:
>>>>
>>>>> Gauges are incredibly useful for exposing the current state of the
>>>>> system. For instance, number of elements in a queue, current memory usage,
>>>>> number of RPCs in flight, etc. As mentioned above, these concepts exist in
>>>>> numerous systems for monitoring distributed environments, including
>>>>> Stackdriver Monitoring. The key to making them work is the addition of
>>>>> labels or tags, which as an aside are also useful for *all* metric types,
>>>>> not just Gauges.
>>>>>
>>>>> If Beam gets rid of Gauges, how would we go about exporting "current"
>>>>> values like memory usage, RPCs in flight, etc.?
>>>>>
>>>>> -- Ben
>>>>>
>>>>> On Fri, Apr 6, 2018 at 11:13 AM Kenneth Knowles <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Just naively - the use cases that Gauge addresses seem relevant, and
>>>>>> the information seems feasible to gather and present. The bit that 
>>>>>> doesn't
>>>>>> seem to make sense is aggregating gauges by clobbering each other. So I
>>>>>> think that's just +1 Ben?
>>>>>>
>>>>>> On Fri, Apr 6, 2018 at 10:26 AM Raghu Angadi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I am not opposed to removing other data types, though they are extra
>>>>>>> convenience for user.
>>>>>>>
>>>>>>> In Scott's example above, if the metric is a counter, what are the
>>>>>>> guarantees provided? E.g. would it match the global count using GBK? If
>>>>>>> yes, then gauges (especially per-key gauges) can be very useful too 
>>>>>>> (e.g.
>>>>>>> backlog for each Kafka partition/split).
>>>>>>>
>>>>>>> On Fri, Apr 6, 2018 at 10:01 AM Robert Bradshaw <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> A String API makes it clear(er) that the values will not be
>>>>>>>> aggregated in any way across workers. I don't think retaining both APIs
>>>>>>>> (except for possibly some short migration period) worthwhile. On 
>>>>>>>> another
>>>>>>>> note, I still find the distributed gague API to be a bit odd in 
>>>>>>>> general.
>>>>>>>>
>>>>>>>> On Fri, Apr 6, 2018 at 9:46 AM Raghu Angadi <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I would be in favor of replacing the existing Gauge.set(long) API
>>>>>>>>>> with the String version and removing the old one. This would be a 
>>>>>>>>>> breaking
>>>>>>>>>> change. However this is a relatively new API and is still marked
>>>>>>>>>> @Experimental. Keeping the old API would retain the potential 
>>>>>>>>>> confusion.
>>>>>>>>>> It's better to simplify the API surface: having two APIs makes it 
>>>>>>>>>> less
>>>>>>>>>> clear which one users should choose.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Supporting additional data types sounds good. But the above states
>>>>>>>>> string API will replace the existing API. I do not see how string API 
>>>>>>>>> makes
>>>>>>>>> the semantics more clear.  Semantically both are same to the user.
>>>>>>>>>
>>>>>>>>> On Fri, Apr 6, 2018 at 9:31 AM Pablo Estrada <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ben : D
>>>>>>>>>>
>>>>>>>>>> Sure, that's reasonable. And perhaps I started the discussion in
>>>>>>>>>> the wrong direction. I'm not questioning the utility of Gauge 
>>>>>>>>>> metrics.
>>>>>>>>>>
>>>>>>>>>> What I'm saying is that Beam only supports integers,, but Gauges
>>>>>>>>>> are aggregated by dropping old values depending on their update 
>>>>>>>>>> times; so
>>>>>>>>>> it might be desirable to not restrict the data type to just integers.
>>>>>>>>>>
>>>>>>>>>> -P.
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 6, 2018 at 9:19 AM Ben Chambers <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> See for instance how gauge metrics are handled in Prometheus,
>>>>>>>>>>> Datadog and Stackdriver monitoring. Gauges are perfect for use in
>>>>>>>>>>> distributed systems, they just need to be properly labeled. Perhaps 
>>>>>>>>>>> we
>>>>>>>>>>> should apply a default tag or allow users to specify one.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 6, 2018, 9:14 AM Ben Chambers <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Some metrics backend label the value, for instance with the
>>>>>>>>>>>> worker that sent it. Then the aggregation is latest per label. 
>>>>>>>>>>>> This makes
>>>>>>>>>>>> it useful for holding values such as "memory usage" that need to 
>>>>>>>>>>>> hold
>>>>>>>>>>>> current value.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 6, 2018, 9:00 AM Scott Wegner <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1 on the proposal to support a "String" gauge.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To expand a bit, the current API doesn't make it clear that
>>>>>>>>>>>>> the gauge value is based on local state. If a runner chooses to 
>>>>>>>>>>>>> parallelize
>>>>>>>>>>>>> a DoFn across many workers, each worker will have its own local 
>>>>>>>>>>>>> Gauge
>>>>>>>>>>>>> metric and its updates will overwrite other values. For example, 
>>>>>>>>>>>>> from the
>>>>>>>>>>>>> API it looks like you could use a gauge to implement your own 
>>>>>>>>>>>>> element count
>>>>>>>>>>>>> metric:
>>>>>>>>>>>>>
>>>>>>>>>>>>> long count = 0;
>>>>>>>>>>>>> @ProcessElement
>>>>>>>>>>>>> public void processElement(ProcessContext c) {
>>>>>>>>>>>>>   myGauge.set(++count);
>>>>>>>>>>>>>   c.output(c.element());
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> This looks correct, but each worker has their own local
>>>>>>>>>>>>> 'count' field, and gauge metric updates from parallel workers will
>>>>>>>>>>>>> overwrite each other rather than get aggregated. So the final 
>>>>>>>>>>>>> value would
>>>>>>>>>>>>> be "the number of elements processed on one of the workers". (The 
>>>>>>>>>>>>> correct
>>>>>>>>>>>>> implementation uses a Counter metric).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would be in favor of replacing the existing Gauge.set(long)
>>>>>>>>>>>>> API with the String version and removing the old one. This would 
>>>>>>>>>>>>> be a
>>>>>>>>>>>>> breaking change. However this is a relatively new API and is 
>>>>>>>>>>>>> still marked
>>>>>>>>>>>>> @Experimental. Keeping the old API would retain the potential 
>>>>>>>>>>>>> confusion.
>>>>>>>>>>>>> It's better to simplify the API surface: having two APIs makes it 
>>>>>>>>>>>>> less
>>>>>>>>>>>>> clear which one users should choose.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 6, 2018 at 8:28 AM Pablo Estrada <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>> As I was working on adding support for Gauges in Dataflow,
>>>>>>>>>>>>>> some noted that Gauge is a fairly unusual kind of metric for a 
>>>>>>>>>>>>>> distributed
>>>>>>>>>>>>>> environment, since many workers will report different values and 
>>>>>>>>>>>>>> stomp on
>>>>>>>>>>>>>> each other's all the time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We also looked at Flink and Dropwizard Gauge metrics [1][2],
>>>>>>>>>>>>>> and we found that these use generics, and Flink explicitly 
>>>>>>>>>>>>>> mentions that a
>>>>>>>>>>>>>> toString implementation is required[3].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With that in mind, I'm thinking that it might make sense to
>>>>>>>>>>>>>> 1) expand Gauge to support string values (keep int-based API for 
>>>>>>>>>>>>>> backwards
>>>>>>>>>>>>>> compatibility), and migrate it to use string behind the covers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What does everyone think about this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>> -P.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1 -
>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/metrics.html#metric-types
>>>>>>>>>>>>>> 2 - https://metrics.dropwizard.io/3.1.0/manual/core/#gauges
>>>>>>>>>>>>>> 3 -
>>>>>>>>>>>>>> https://github.com/apache/flink/blob/master/docs/monitoring/metrics.md#gauge
>>>>>>>>>>>>>> JIRA issue for Gauge metrics -
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-1616
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Got feedback? http://go/swegner-feedback
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>>>
>>>>>>>>> --
>>>> Got feedback? go/pabloem-feedback
>>>> <https://goto.google.com/pabloem-feedback>
>>>>
>>>

Re: About the Gauge metric API

Reply via email to