Re: [DISCUSS] FLIP-33: Standardize connector metrics

Becket Qin Thu, 21 Feb 2019 07:28:11 -0800

Hi Chesnay,

Thanks for the comments. I think this is worthy of a FLIP because it is
public API. According to the FLIP description a FlIP is required in case of:

   - Any change that impacts the public interfaces of the project

and the following entry is found in the definition of "public interface".

   - Exposed monitoring information

Metrics are critical to any production system. So a clear metric definition
is important for any serious users. For an organization with large Flink
installation, change in metrics means great amount of work. So such changes
do need to be fully discussed and documented.

** Re: Histogram.
We can discuss whether there is a better way to expose metrics that are
suitable for histograms. My micro-benchmark on various histogram
implementations also indicates that they are significantly slower than
other metric types. But I don't think that means never use histogram, but
means use it with caution. For example, we can suggest the implementations
to turn them off by default and only turn it on for a small amount of time
when performing some micro-debugging.

** Re: complication:
Connector conventions are essential for Flink ecosystem. Flink connectors
pool is probably the most important part of Flink, just like any other data
system. Clear conventions of connectors will help build Flink ecosystem in
a more organic way.
Take the metrics convention as an example, imagine someone has developed a
Flink connector for System foo, and another developer may have developed a
monitoring and diagnostic framework for Flink which analyzes the Flink job
performance based on metrics. With a clear metric convention, those two
projects could be developed independently.  Once users put them together,
it would work without additional modifications. This cannot be easily
achieved by just defining a few constants.

** Re: selective metrics:
Sure, we can discuss that in a separate thread.

@Dawid

** Re: latency / fetchedLatency
The primary purpose of establish such a convention is to help developers
write connectors in a more compatible way. The convention is supposed to be
defined more proactively. So when look at the convention, it seems more
important to see if the concept is applicable to connectors in general. It
might be true so far only Kafka connector reports latency. But there might
be hundreds of other connector implementations in the Flink ecosystem,
though not in the Flink repo, and some of them also emits latency. I think
a lot of other sources actually also has an append timestamp. e.g. database
bin logs and some K-V stores. So I wouldn't be surprised if some database
connector can also emit latency metrics.

Thanks,

Jiangjie (Becket) Qin

On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler <ches...@apache.org>
wrote:

> Regarding 2) It doesn't make sense to investigate this as part of this
> FLIP. This is something that could be of interest for the entire metric
> system, and should be designed for as such.
>
> Regarding the proposal as a whole:
>
> Histogram metrics shall not be added to the core of Flink. They are
> significantly more expensive than other metrics, and calculating
> histograms in the application is regarded as an anti-pattern by several
> metric backends, who instead recommend to expose the raw data and
> calculate the histogram in the backend.
>
> Second, this seems overly complicated. Given that we already established
> that not all connectors will export all metrics we are effectively
> reducing this down to a consistent naming scheme. We don't need anything
> sophisticated for that; basically just a few constants that all
> connectors use.
>
> I'm not convinced that this is worthy of a FLIP.
>
> On 21.02.2019 14:26, Dawid Wysakowicz wrote:
> > Hi,
> >
> > Ad 1. In general I undestand and I agree. But those particular metrics
> > (latency, fetchLatency), right now would only be reported if user uses
> > KafkaConsumer with internal timestampAssigner with StreamCharacteristic
> > set to EventTime, right? That sounds like a very specific case. I am not
> > sure if we should introduce a generic metric that will be
> > disabled/absent for most of implementations.
> >
> > Ad.2 That sounds like an orthogonal issue, that might make sense to
> > investigate in the future.
> >
> > Best,
> >
> > Dawid
> >
> > On 21/02/2019 13:20, Becket Qin wrote:
> >> Hi Dawid,
> >>
> >> Thanks for the feedback. That makes sense to me. There are two cases to
> be
> >> addressed.
> >>
> >> 1. The metrics are supposed to be a guidance. It is likely that a
> connector
> >> only supports some but not all of the metrics. In that case, each
> connector
> >> implementation should have the freedom to decide which metrics are
> >> reported. For the metrics that are supported, the guidance should be
> >> followed.
> >>
> >> 2. Sometimes users may want to disable certain metrics for some reason
> >> (e.g. performance / reprocessing of data). A generic mechanism should be
> >> provided to allow user choose which metrics are reported. This mechanism
> >> should also be honored by the connector implementations.
> >>
> >> Does this sound reasonable to you?
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >>
> >> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz <
> dwysakow...@apache.org>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Generally I like the idea of having a unified, standard set of metrics
> for
> >>> all connectors. I have some slight concerns about fetchLatency and
> >>> latency though. They are computed based on EventTime which is not a
> purely
> >>> technical feature. It depends often on some business logic, might be
> absent
> >>> or defined after source. Those metrics could also behave in a weird
> way in
> >>> case of replaying backlog. Therefore I am not sure if we should include
> >>> those metrics by default. Maybe we could at least introduce a feature
> >>> switch for them? What do you think?
> >>>
> >>> Best,
> >>>
> >>> Dawid
> >>> On 21/02/2019 03:13, Becket Qin wrote:
> >>>
> >>> Bump. If there is no objections to the proposed metrics. I'll start a
> >>> voting thread later toady.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin <becket....@gmail.com> <
> becket....@gmail.com> wrote:
> >>>
> >>>
> >>> Hi folks,
> >>>
> >>> I would like to start the FLIP discussion thread about standardize the
> >>> connector metrics.
> >>>
> >>> In short, we would like to provide a convention of Flink connector
> >>> metrics. It will help simplify the monitoring and alerting on Flink
> jobs.
> >>> The FLIP link is following:
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>>
> >>>
>
>

Re: [DISCUSS] FLIP-33: Standardize connector metrics

Reply via email to