Re: [DISCUSS] FLIP-179: Expose Standardized Operator Metrics

Arvid Heise Thu, 22 Jul 2021 00:36:49 -0700

Hi all,

@Steven Wu <stevenz...@gmail.com>


> Regarding "lastFetchTime" latency metric, I found Gauge to be less
> informative as it only captures the last sampling value for each metric
> publish interval (e.g. 60s).
> * Can we make it a histogram? Histograms are more expensive though.
> * Timer [1, 2] is cheaper as it just tracks min, max, avg, count. but there
> is no such metric type in Flink
> * Summary metric type [3] (from Prometheus) would be nice too
>
I'd also think that a histogram is much more expressive but the original
FLIP-33 decided against it because of it's cost. @Chesnay Schepler
<ches...@apache.org> could you shed some light on how much more expensive
it is in comparison to a simple gauge? Does it depend on whether a reporter
is actually using the metric?
The current interface of this FLIP-179 would actually allow to switch the
type of the metric later. But since the metric type is user-facing, we need
to have an agreement now.

@Becket Qin <becket....@gmail.com>

> In that case, do we still need the metric here? It seems we are creating a
> "global variable" which users may potentially use. I am wondering how much
> additional convenience it provides because it seems easy for people to
> simply pass the fetch time by themselves if they have decided to not use
> SourceReaderBase. Also, it looks like we do not have an API pattern that
> lets users get the value of a metric and derive another metric. So I think
> it is easier for people to understand if LastFetchTimeGauge() is just an
> independent metric by itself, instead of being a part of the
> eventTimeFetchLag computation.
>
I'm not sure if I follow the global variable argument, could you elaborate?
Are you referring specifically to the SettableGauge? How is that different
from a Counter or Meter?

With the current design, we could very well add a LastFetchTime metric. The
key point of the current abstraction is that a user gets the much harder
eventTimeFetchLag metric for free, since we already need to extract the
event time for other metrics. I think the JavaDoc makes it clear what the
intent of the LastFetchTimeGauge is and if not we can improve it.
Btw we have derived metrics already. For example, we have Meters for
byteIn/Out and recordIn/Out. That's already part of FLIP-33.

Would it make sense to have a more generic metadata type <T> associated
> with the records batch? In some cases, it may be useful to allow the Source
> implementation to carry some additional information of the batch to the
> RecordEmitter. For example, the split info of the batch, the sender of the
> batch etc. Because the RecordEmitter only takes one record at.a time,
> currently such information needs to be put into each record, which may
> involve a lot of wrapper object creation.
>
I like the idea of having more general metadata and I follow the example.
I'm wondering if we could avoid a generic type (since that adds a bit of
complexity to the mental model and usage) by simply encouraging to use a
more specific MetaData subclass as a return type of the method.

public interface RecordsWithSplitIds<E> {
    @Nullable
    default RecordMetadata getMetadata() {
        return null;
    }
    ...
}

public interface RecordMetadata {
    long getLastFetchTime(); // mandatory?
}

And using it as

public class KafkaRecordMetadata implements RecordMetadata {}

private static class KafkaPartitionSplitRecords<T> implements
RecordsWithSplitIds<T> {
    @Override
    public KafkaRecordMetadata getMetadata() {
        return metadata;
    }
}

Or do we want to have the generic to explicitly pass it to the
RecordEmitter? Would that metadata be a fourth parameter of
RecordEmitter#emitRecord?

It might be slightly better if we let the method accept a Supplier in this
> case. However, it seems to introduce a parallel channel or a sidepath
> between the user implementation and SourceOutput. I am not sure if this is
> the right way to go. Would it be more intuitive if we just add a new method
> to the SourceOutput, to allow the FetchTime to be passed in explicitly?
> This would work well with the change I suggested above, which adds a
> generic metadata type <T> to the RecordsWithSplits and passes that to the
> RecordEmitter.emitRecord() as an argument.
>

We could do that. That would remove the gauge from the MetricGroup, right?
The main downside is that sources that do not use SourceReaderBase cannot
set the metric anymore. So I'd rather keep the current way and extend it
with the metadata extension.

Best,

Arvid


On Wed, Jul 21, 2021 at 1:38 PM Becket Qin <becket....@gmail.com> wrote:

> Hey Chesnay,
>
> I think I got what that method was designed for now. Basically the
> motivation is to let the SourceOutput to report the eventTimeFetchLag for
> users. At this point, the SourceOutput only has the EventTime, so this
> method provides a way for the users to pass the FetchTime to the
> SourceOutput. This is essentially a context associated with each record
> emitted to the SourceOutput.
>
> It might be slightly better if we let the method accept a Supplier in this
> case. However, it seems to introduce a parallel channel or a sidepath
> between the user implementation and SourceOutput. I am not sure if this is
> the right way to go. Would it be more intuitive if we just add a new method
> to the SourceOutput, to allow the FetchTime to be passed in explicitly?
> This would work well with the change I suggested above, which adds a
> generic metadata type <T> to the RecordsWithSplits and passes that to the
> RecordEmitter.emitRecord() as an argument.
>
> What do you think?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Jul 20, 2021 at 2:50 PM Chesnay Schepler <ches...@apache.org>
> wrote:
>
> > Would it be easier to understand if the method would accept a Supplier
> > instead?
> >
> > On 20/07/2021 05:36, Becket Qin wrote:
> > > In that case, do we still need the metric here? It seems we are
> creating
> > a
> > > "global variable" which users may potentially use. I am wondering how
> > much
> > > additional convenience it provides because it seems easy for people to
> > > simply pass the fetch time by themselves if they have decided to not
> use
> > > SourceReaderBase. Also, it looks like we do not have an API pattern
> that
> > > lets users get the value of a metric and derive another metric. So I
> > think
> > > it is easier for people to understand if LastFetchTimeGauge() is just
> an
> > > independent metric by itself, instead of being a part of the
> > > eventTimeFetchLag computation.
> >
> >
> >
>

Re: [DISCUSS] FLIP-179: Expose Standardized Operator Metrics

Reply via email to