Re: [DISCUSS] FLIP-33: Standardize connector metrics

Becket Qin Fri, 04 Sep 2020 07:57:05 -0700

Hi all,

To complete the Source refactoring work, I'd like to revive this
discussion. Since the mail thread has been dormant for more than a year,
just to recap the motivation of the FLIP:


1. The FLIP proposes to standardize the connector metrics by giving
guidance on the metric specifications, including the name, type and meaning
of the metrics.
2. It is OK for a connector to not emit some of the metrics in the metric
guidance, but if a metric of the same semantic is emitted, the
implementation should follow the guidance.
3. It is OK for a connector to emit more metrics than what are listed in
the FLIP. This includes having an alias for a metric specified in the FLIP.
4. We will implement some of the metrics out of the box in the default
implementation of FLIP-27, as long as it is applicable.

The FLIP wiki is following:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-33
%3A+Standardize+Connector+Metrics

Any thoughts?

Thanks,

Jiangjie (Becket) Qin


On Fri, Jun 14, 2019 at 2:29 PM Piotr Nowojski <pi...@ververica.com> wrote:

> > we will need to revisit the convention list and adjust them accordingly
> when FLIP-27 is ready
>
>
> Yes, this sounds good :)
>
> Piotrek
>
> > On 13 Jun 2019, at 13:35, Becket Qin <becket....@gmail.com> wrote:
> >
> > Hi Piotr,
> >
> > That's great to know. Chances are that we will need to revisit the
> > convention list and adjust them accordingly when FLIP-27 is ready, At
> that
> > point we can mark some of the metrics as available by default for
> > connectors implementing the new interface.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Thu, Jun 13, 2019 at 3:51 PM Piotr Nowojski <pi...@ververica.com>
> wrote:
> >
> >> Thanks for driving this. I’ve just noticed one small thing. With new
> >> SourceReader interface Flink will be able to provide `idleTime` metric
> >> automatically.
> >>
> >> Piotrek
> >>
> >>> On 13 Jun 2019, at 03:30, Becket Qin <becket....@gmail.com> wrote:
> >>>
> >>> Thanks all for the feedback and discussion.
> >>>
> >>> Since there wasn't any concern raised, I've started the voting thread
> for
> >>> this FLIP, but please feel free to continue the discussion here if you
> >>> think something still needs to be addressed.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>>
> >>>
> >>> On Mon, Jun 10, 2019 at 9:10 AM Becket Qin <becket....@gmail.com>
> wrote:
> >>>
> >>>> Hi Piotr,
> >>>>
> >>>> Thanks for the comments. Yes, you are right. Users will have to look
> at
> >>>> other metrics to decide whether the pipeline is healthy or not in the
> >> first
> >>>> place before they can use the time-based metric to fix the bottleneck.
> >>>>
> >>>> I agree that once we have FLIP-27 ready, some of the metrics can just
> be
> >>>> reported by the abstract implementation.
> >>>>
> >>>> I've updated FLIP-33 wiki page to add the pendingBytes and
> >> pendingRecords
> >>>> metric. Please let me know if you have any concern over the updated
> >> metric
> >>>> convention proposal.
> >>>>
> >>>> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen
> >>>> <step...@ververica.com> will you also have time to take a look at the
> >>>> proposed metric convention? If there is no further concern I'll start
> a
> >>>> voting thread for this FLIP in two days.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski <pi...@ververica.com>
> >> wrote:
> >>>>
> >>>>> Hi Becket,
> >>>>>
> >>>>> Thanks for the answer :)
> >>>>>
> >>>>>> By time-based metric, I meant the portion of time spent on producing
> >> the
> >>>>>> record to downstream. For example, a source connector can report
> that
> >>>>> it's
> >>>>>> spending 80% of time to emit record to downstream processing
> pipeline.
> >>>>> In
> >>>>>> another case, a sink connector may report that its spending 30% of
> >> time
> >>>>>> producing the records to the external system.
> >>>>>>
> >>>>>> This is in some sense equivalent to the buffer usage metric:
> >>>>>
> >>>>>> - 80% of time spent on emitting records to downstream --->
> downstream
> >>>>>> node is bottleneck ---> output buffer is probably full.
> >>>>>> - 30% of time spent on emitting records to downstream --->
> downstream
> >>>>>> node is not bottleneck ---> output buffer is probably not full.
> >>>>>
> >>>>> If by “time spent on emitting records to downstream” you understand
> >>>>> “waiting on back pressure”, then I see your point. And I agree that
> >> some
> >>>>> kind of ratio/time based metric gives you more information. However
> >> under
> >>>>> “time spent on emitting records to downstream” might be hidden the
> >>>>> following (extreme) situation:
> >>>>>
> >>>>> 1. Job is barely able to handle influx of records, there is 99%
> >>>>> CPU/resource usage in the cluster, but nobody is
> >>>>> bottlenecked/backpressured, all output buffers are empty, everybody
> is
> >>>>> waiting in 1% of it’s time for more records to process.
> >>>>> 2. 80% time can still be spent on "down stream operators”, because
> they
> >>>>> are the CPU intensive operations, but this doesn’t mean that
> >> increasing the
> >>>>> parallelism down the stream will help with anything there. To the
> >> contrary,
> >>>>> increasing parallelism of the source operator might help to increase
> >>>>> resource utilisation up to 100%.
> >>>>>
> >>>>> However, this “time based/ratio” approach can be extended to
> in/output
> >>>>> buffer usage. Besides collecting an information that input/output
> >> buffer is
> >>>>> full/empty, we can probe profile how often are buffer empty/full. If
> >> output
> >>>>> buffer is full 1% of times, there is almost no back pressure. If it’s
> >> full
> >>>>> 80% of times, there is some back pressure, if it’s full 99.9% of
> times,
> >>>>> there is huge back pressure.
> >>>>>
> >>>>> Now for autoscaling you could compare the input & output buffers fill
> >>>>> ratio:
> >>>>>
> >>>>> 1. Both are high, the source of bottleneck is down the stream
> >>>>> 2. Output is low, input is high, this is the bottleneck and the
> higher
> >>>>> the difference, the bigger source of bottleneck is this is
> >> operator/task
> >>>>> 3. Output is high, input is low - there was some load spike that we
> are
> >>>>> currently finishing to process
> >>>>>
> >>>>>
> >>>>>
> >>>>> But long story short, we are probably diverging from the topic of
> this
> >>>>> discussion, and we can discuss this at some later point.
> >>>>>
> >>>>> For now, for sources:
> >>>>>
> >>>>> as I wrote before, +1 for:
> >>>>> - pending.bytes, Gauge
> >>>>> - pending.messages, Gauge
> >>>>>
> >>>>> When we will be developing/discussing SourceReader from FLIP-27 we
> >> might
> >>>>> then add:
> >>>>>
> >>>>> - in-memory.buffer.usage (0 - 100%)
> >>>>>
> >>>>> Which will be estimated automatically by Flink while user will be
> able
> >> to
> >>>>> override/provide better estimation.
> >>>>>
> >>>>> Piotrek
> >>>>>
> >>>>>> On 5 Jun 2019, at 05:42, Becket Qin <becket....@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Piotr,
> >>>>>>
> >>>>>> Thanks for the explanation. Please see some clarifications below.
> >>>>>>
> >>>>>> By time-based metric, I meant the portion of time spent on producing
> >> the
> >>>>>> record to downstream. For example, a source connector can report
> that
> >>>>> it's
> >>>>>> spending 80% of time to emit record to downstream processing
> pipeline.
> >>>>> In
> >>>>>> another case, a sink connector may report that its spending 30% of
> >> time
> >>>>>> producing the records to the external system.
> >>>>>>
> >>>>>> This is in some sense equivalent to the buffer usage metric:
> >>>>>> - 80% of time spent on emitting records to downstream --->
> downstream
> >>>>>> node is bottleneck ---> output buffer is probably full.
> >>>>>> - 30% of time spent on emitting records to downstream --->
> downstream
> >>>>>> node is not bottleneck ---> output buffer is probably not full.
> >>>>>>
> >>>>>> However, the time-based metric has a few advantages that the buffer
> >>>>> usage
> >>>>>> metric may not have.
> >>>>>>
> >>>>>> 1.  Buffer usage metric may not be applicable to all the connector
> >>>>>> implementations, while reporting time-based metric are always
> doable.
> >>>>>> Some source connectors may not have any input buffer, or they may
> use
> >>>>> some
> >>>>>> third party library that does not expose the input buffer at all.
> >>>>>> Similarly, for sink connectors, the implementation may not have any
> >>>>> output
> >>>>>> buffer, or the third party library does not expose such buffer.
> >>>>>>
> >>>>>> 2. Although both type of metrics can detect bottleneck, time-based
> >>>>> metrics
> >>>>>> can be used to generate a more informed action to remove the
> >> bottleneck.
> >>>>>> For example, when the downstream is bottleneck, the output buffer
> >> usage
> >>>>>> metric is likely to be 100%, and the input buffer usage might be 0%.
> >>>>> That
> >>>>>> means we don't know what is the suitable parallelism to lift the
> >>>>>> bottleneck. The time-based metric, on the other hand, would give
> >> useful
> >>>>>> information, e.g. if 80% of time was spent on emitting records, we
> can
> >>>>>> roughly increase the downstream node parallelism by 4 times.
> >>>>>>
> >>>>>> Admittedly, the time-based metrics are more expensive than buffer
> >>>>> usage. So
> >>>>>> we may have to do some sampling to reduce the cost. But in general,
> >>>>> using
> >>>>>> time-based metrics seems worth adding.
> >>>>>>
> >>>>>> That being said, I don't think buffer usage metric and time-based
> >>>>> metrics
> >>>>>> are mutually exclusive. We can probably have both. It is just that
> in
> >>>>>> practice, features like auto-scaling might prefer time-based metrics
> >> for
> >>>>>> the reason stated above.
> >>>>>>
> >>>>>>> 1. Define the metrics that would allow us to manually detect
> >>>>> bottlenecks.
> >>>>>> As I wrote, we already have them in most of the places, except of
> >>>>>> sources/sinks.
> >>>>>>> 2. Use those metrics, to automatically detect bottlenecks.
> Currently
> >> we
> >>>>>> are only automatically detecting back pressure and reporting it to
> the
> >>>>> user
> >>>>>> in web UI (is it exposed as a metric at all?). Detecting the root
> >> cause
> >>>>> of
> >>>>>> the back pressure (bottleneck) is one step further.
> >>>>>>> 3. Use the knowledge about where exactly the bottleneck is located,
> >> to
> >>>>>> try to do something with it.
> >>>>>>
> >>>>>> As explained above, I think time-based metric also addresses item 1
> >> and
> >>>>>> item 2.
> >>>>>>
> >>>>>> Any thoughts？
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski <pi...@ververica.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hi again :)
> >>>>>>>
> >>>>>>>> - pending.bytes, Gauge
> >>>>>>>> - pending.messages, Gauge
> >>>>>>>
> >>>>>>>
> >>>>>>> +1
> >>>>>>>
> >>>>>>> And true, instead of overloading one of the metric it is better
> when
> >>>>> user
> >>>>>>> can choose to provide only one of them.
> >>>>>>>
> >>>>>>> Re 2:
> >>>>>>>
> >>>>>>>> If I understand correctly, this metric along with the pending
> >> mesages
> >>>>> /
> >>>>>>>> bytes would answer the questions of:
> >>>>>>>
> >>>>>>>> - Does the connector consume fast enough? Lagging behind + empty
> >>>>> buffer
> >>>>>>> =
> >>>>>>>> cannot consume fast enough.
> >>>>>>>> - Does the connector emit fast enough? Lagging behind + full
> buffer
> >> =
> >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow.
> >>>>>>>
> >>>>>>> Yes, exactly. This can also be used to support decisions like
> >> changing
> >>>>> the
> >>>>>>> parallelism of the sources and/or down stream operators.
> >>>>>>>
> >>>>>>> I’m not sure if I understand your proposal with time based
> >>>>> measurements.
> >>>>>>> Maybe I’m missing something, but I do not see how measuring time
> >> alone
> >>>>>>> could answer the problem: where is the bottleneck. Time spent on
> the
> >>>>>>> next/emit might be short or long (depending on how heavy to process
> >> the
> >>>>>>> record is) and the source can still be bottlenecked/back pressured
> or
> >>>>> not.
> >>>>>>> Usually the easiest and the most reliable way how to detect
> >>>>> bottlenecks is
> >>>>>>> by checking usage of input & output buffers, since when input
> buffer
> >> is
> >>>>>>> full while output buffer is empty, that’s the definition of a
> >>>>> bottleneck.
> >>>>>>> Also this is usually very easy and cheap to measure (it works
> >>>>> effectively
> >>>>>>> the same way as current’s Flink back pressure monitoring, but more
> >>>>> cleanly,
> >>>>>>> without probing thread’s stack traces).
> >>>>>>>
> >>>>>>> Also keep in mind that we are already using the buffer usage
> metrics
> >>>>> for
> >>>>>>> detecting the bottlenecks in Flink’s internal network exchanges
> >> (manual
> >>>>>>> work). That’s the reason why I wanted to extend this to
> >> sources/sinks,
> >>>>>>> since they are currently our blind spot.
> >>>>>>>
> >>>>>>>> One feature we are currently working on to scale Flink
> automatically
> >>>>>>> relies
> >>>>>>>> on some metrics answering the same question
> >>>>>>>
> >>>>>>> That would be very helpful feature. I think in order to achieve
> that
> >> we
> >>>>>>> would need to:
> >>>>>>> 1. Define the metrics that would allow us to manually detect
> >>>>> bottlenecks.
> >>>>>>> As I wrote, we already have them in most of the places, except of
> >>>>>>> sources/sinks.
> >>>>>>> 2. Use those metrics, to automatically detect bottlenecks.
> Currently
> >> we
> >>>>>>> are only automatically detecting back pressure and reporting it to
> >> the
> >>>>> user
> >>>>>>> in web UI (is it exposed as a metric at all?). Detecting the root
> >>>>> cause of
> >>>>>>> the back pressure (bottleneck) is one step further.
> >>>>>>> 3. Use the knowledge about where exactly the bottleneck is located,
> >> to
> >>>>> try
> >>>>>>> to do something with it.
> >>>>>>>
> >>>>>>> I think you are aiming for point 3., but before we reach it, we are
> >>>>> still
> >>>>>>> missing 1. & 2. Also even if we have 3., there is a value in 1 & 2
> >> for
> >>>>>>> manual analysis/dashboards.
> >>>>>>>
> >>>>>>> However, having the knowledge of where the bottleneck is, doesn’t
> >>>>>>> necessarily mean that we know what we can do about it. For example
> >>>>>>> increasing parallelism might or might not help with anything (data
> >>>>> skew,
> >>>>>>> bottleneck on some resource that does not scale), but this remark
> >>>>> applies
> >>>>>>> always, regardless of the way how did we detect the bottleneck.
> >>>>>>>
> >>>>>>> Piotrek
> >>>>>>>
> >>>>>>>> On 3 Jun 2019, at 06:16, Becket Qin <becket....@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Piotr,
> >>>>>>>>
> >>>>>>>> Thanks for the suggestion. Some thoughts below:
> >>>>>>>>
> >>>>>>>> Re 1: The pending messages / bytes.
> >>>>>>>> I completely agree these are very useful metrics and we should
> >> expect
> >>>>> the
> >>>>>>>> connector to report. WRT the way to expose them, it seems more
> >>>>> consistent
> >>>>>>>> to add two metrics instead of adding a method (unless there are
> >> other
> >>>>> use
> >>>>>>>> cases other than metric reporting). So we can add the following
> two
> >>>>>>> metrics.
> >>>>>>>> - pending.bytes, Gauge
> >>>>>>>> - pending.messages, Gauge
> >>>>>>>> Applicable connectors can choose to report them. These two metrics
> >>>>> along
> >>>>>>>> with latency should be sufficient for users to understand the
> >> progress
> >>>>>>> of a
> >>>>>>>> connector.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Re 2: Number of buffered data in-memory of the connector
> >>>>>>>> If I understand correctly, this metric along with the pending
> >> mesages
> >>>>> /
> >>>>>>>> bytes would answer the questions of:
> >>>>>>>> - Does the connector consume fast enough? Lagging behind + empty
> >>>>> buffer
> >>>>>>> =
> >>>>>>>> cannot consume fast enough.
> >>>>>>>> - Does the connector emit fast enough? Lagging behind + full
> buffer
> >> =
> >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow.
> >>>>>>>>
> >>>>>>>> One feature we are currently working on to scale Flink
> automatically
> >>>>>>> relies
> >>>>>>>> on some metrics answering the same question, more specifically, we
> >> are
> >>>>>>>> profiling the time spent on .next() method (time to consume) and
> the
> >>>>> time
> >>>>>>>> spent on .collect() method (time to emit / process). One advantage
> >> of
> >>>>>>> such
> >>>>>>>> method level time cost allows us to calculate the parallelism
> >>>>> required to
> >>>>>>>> keep up in case their is a lag.
> >>>>>>>>
> >>>>>>>> However, one concern I have regarding such metric is that they are
> >>>>>>>> implementation specific. Either profiling on the method time, or
> >>>>>>> reporting
> >>>>>>>> buffer usage assumes the connector are implemented in such a way.
> A
> >>>>>>>> slightly better solution might be have the following metric:
> >>>>>>>>
> >>>>>>>>  - EmitTimeRatio (or FetchTimeRatio): The time spent on emitting
> >>>>>>>> records / Total time elapsed.
> >>>>>>>>
> >>>>>>>> This assumes that the source connectors have to emit the records
> to
> >>>>> the
> >>>>>>>> downstream at some point. The emission may take some time ( e.g.
> go
> >>>>>>> through
> >>>>>>>> chained operators). And the rest of the time are spent to prepare
> >> the
> >>>>>>>> record to emit, including time for consuming and format
> conversion,
> >>>>> etc.
> >>>>>>>> Ideally, we'd like to see the time spent on record fetch and emit
> to
> >>>>> be
> >>>>>>>> about the same, so no one is bottleneck for the other.
> >>>>>>>>
> >>>>>>>> The downside of these time based metrics is additional overhead to
> >> get
> >>>>>>> the
> >>>>>>>> time, therefore sampling might be needed. But in practice I feel
> >> such
> >>>>>>> time
> >>>>>>>> based metric might be more useful if we want to take action.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I think we should absolutely add metrics in (1) to the metric
> >>>>> convention.
> >>>>>>>> We could also add the metrics mentioned in (2) if we reach
> consensus
> >>>>> on
> >>>>>>>> that. What do you think?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski <
> pi...@ververica.com
> >>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Becket,
> >>>>>>>>>
> >>>>>>>>> Re 1a) and 1b) +1 from my side.
> >>>>>>>>>
> >>>>>>>>> I’ve discussed this issue:
> >>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us to check
> the
> >>>>> cause
> >>>>>>>>> of
> >>>>>>>>>>>> back pressure:
> >>>>>>>>>>>> a) for sources, length of input queue (in bytes? Or boolean
> >>>>>>>>>>>> hasSomethingl/isEmpty)
> >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean
> >>>>>>>>>>>> hasSomething/isEmpty)
> >>>>>>>>>
> >>>>>>>>> With Nico at some lengths and he also saw the benefits of them.
> We
> >>>>> also
> >>>>>>>>> have more concrete proposal for that.
> >>>>>>>>>
> >>>>>>>>> Actually there are two really useful metrics, that we are missing
> >>>>>>>>> currently:
> >>>>>>>>>
> >>>>>>>>> 1. Number of data/records/bytes in the backlog to process. For
> >>>>> example
> >>>>>>>>> remaining number of bytes in unread files. Or pending data in
> Kafka
> >>>>>>> topics.
> >>>>>>>>> 2. Number of buffered data in-memory of the connector, that are
> >>>>> waiting
> >>>>>>> to
> >>>>>>>>> be processed pushed to Flink pipeline.
> >>>>>>>>>
> >>>>>>>>> Re 1:
> >>>>>>>>> This would have to be a metric provided directly by a connector.
> It
> >>>>>>> could
> >>>>>>>>> be an undefined `int`:
> >>>>>>>>>
> >>>>>>>>> `int backlog` - estimate of pending work.
> >>>>>>>>>
> >>>>>>>>> “Undefined” meaning that it would be up to a connector to decided
> >>>>>>> whether
> >>>>>>>>> it’s measured in bytes, records, pending files or whatever it is
> >>>>>>> possible
> >>>>>>>>> to provide by the connector. This is because I assume not every
> >>>>>>> connector
> >>>>>>>>> can provide exact number and for some of them it might be
> >> impossible
> >>>>> to
> >>>>>>>>> provide records number of bytes count.
> >>>>>>>>>
> >>>>>>>>> Re 2:
> >>>>>>>>> This metric could be either provided by a connector, or
> calculated
> >>>>>>> crudely
> >>>>>>>>> by Flink:
> >>>>>>>>>
> >>>>>>>>> `float bufferUsage` - value from [0.0, 1.0] range
> >>>>>>>>>
> >>>>>>>>> Percentage of used in memory buffers, like in Kafka’s handover.
> >>>>>>>>>
> >>>>>>>>> It could be crudely implemented by Flink with FLIP-27
> >>>>>>>>> SourceReader#isAvailable. If SourceReader is not available
> reported
> >>>>>>>>> `bufferUsage` could be 0.0. If it is available, it could be 1.0.
> I
> >>>>> think
> >>>>>>>>> this would be a good enough estimation for most of the use cases
> >>>>> (that
> >>>>>>>>> could be overloaded and implemented better if desired).
> Especially
> >>>>>>> since we
> >>>>>>>>> are reporting only probed values: if probed values are almost
> >> always
> >>>>>>> “1.0”,
> >>>>>>>>> it would mean that we have a back pressure. If they are almost
> >> always
> >>>>>>>>> “0.0”, there is probably no back pressure at the sources.
> >>>>>>>>>
> >>>>>>>>> What do you think about this?
> >>>>>>>>>
> >>>>>>>>> Piotrek
> >>>>>>>>>
> >>>>>>>>>> On 30 May 2019, at 11:41, Becket Qin <becket....@gmail.com>
> >> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for all the feedback and comments. I'd like to
> >> continue
> >>>>>>> the
> >>>>>>>>>> discussion on this FLIP.
> >>>>>>>>>>
> >>>>>>>>>> I updated the FLIP-33 wiki to remove all the histogram metrics
> >> from
> >>>>> the
> >>>>>>>>>> first version of metric convention due to the performance
> concern.
> >>>>> The
> >>>>>>>>> plan
> >>>>>>>>>> is to introduce them later when we have a mechanism to opt
> in/out
> >>>>>>>>> metrics.
> >>>>>>>>>> At that point, users can decide whether they want to pay the
> cost
> >> to
> >>>>>>> get
> >>>>>>>>>> the metric or not.
> >>>>>>>>>>
> >>>>>>>>>> As Stephan suggested, for this FLIP, let's first try to agree on
> >> the
> >>>>>>>>> small
> >>>>>>>>>> list of conventional metrics that connectors should follow.
> >>>>>>>>>> Just to be clear, the purpose of the convention is not to
> enforce
> >>>>> every
> >>>>>>>>>> connector to report all these metrics, but to provide a guidance
> >> in
> >>>>>>> case
> >>>>>>>>>> these metrics are reported by some connectors.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> @ Stephan & Chesnay,
> >>>>>>>>>>
> >>>>>>>>>> Regarding the duplication of `RecordsIn` metric in operator /
> task
> >>>>>>>>>> IOMetricGroups, from what I understand, for source operator, it
> is
> >>>>>>>>> actually
> >>>>>>>>>> the SourceFunction that reports the operator level
> >>>>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are essentially the
> >>>>> same
> >>>>>>>>>> metric in the operator level IOMetricGroup. Similarly for the
> Sink
> >>>>>>>>>> operator, the operator level RecordsOut/RecordsOutPerSecond
> >> metrics
> >>>>> are
> >>>>>>>>>> also reported by the Sink function. I marked them as existing in
> >> the
> >>>>>>>>>> FLIP-33 wiki page. Please let me know if I misunderstood.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin <
> becket....@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Piotr,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>>>>>
> >>>>>>>>>>> 1a) I guess you are referring to the part that "original system
> >>>>>>> specific
> >>>>>>>>>>> metrics should also be reported". The performance impact
> depends
> >> on
> >>>>>>> the
> >>>>>>>>>>> implementation. An efficient implementation would only record
> the
> >>>>>>> metric
> >>>>>>>>>>> once, but report them with two different metric names. This is
> >>>>>>> unlikely
> >>>>>>>>> to
> >>>>>>>>>>> hurt performance.
> >>>>>>>>>>>
> >>>>>>>>>>> 1b) Yes, I agree that we should avoid adding overhead to the
> >>>>> critical
> >>>>>>>>> path
> >>>>>>>>>>> by all means. This is sometimes a tradeoff, running blindly
> >> without
> >>>>>>> any
> >>>>>>>>>>> metric gives best performance, but sometimes might be
> frustrating
> >>>>> when
> >>>>>>>>> we
> >>>>>>>>>>> debug some issues.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. The metrics are indeed very useful. Are they supposed to be
> >>>>>>> reported
> >>>>>>>>> by
> >>>>>>>>>>> the connectors or Flink itself? At this point FLIP-33 is more
> >>>>> focused
> >>>>>>> on
> >>>>>>>>>>> provide a guidance to the connector authors on the metrics
> >>>>> reporting.
> >>>>>>>>> That
> >>>>>>>>>>> said, after FLIP-27, I think we should absolutely report these
> >>>>> metrics
> >>>>>>>>> in
> >>>>>>>>>>> the abstract implementation. In any case, the metric convention
> >> in
> >>>>>>> this
> >>>>>>>>>>> list are expected to evolve over time.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski <
> >>>>> pi...@ververica.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the proposal and driving the effort here Becket :)
> >> I’ve
> >>>>>>> read
> >>>>>>>>>>>> through the FLIP-33 [1], and here are couple of my thoughts.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Big +1 for standardising the metric names between connectors,
> it
> >>>>> will
> >>>>>>>>>>>> definitely help us and users a lot.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Issues/questions/things to discuss that I’ve thought of:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1a. If we are about to duplicate some metrics, can this
> become a
> >>>>>>>>>>>> performance issue?
> >>>>>>>>>>>> 1b. Generally speaking, we should make sure that collecting
> >> those
> >>>>>>>>> metrics
> >>>>>>>>>>>> is as non intrusive as possible, especially that they will
> need
> >>>>> to be
> >>>>>>>>>>>> updated once per record. (They might be collected more rarely
> >> with
> >>>>>>> some
> >>>>>>>>>>>> overhead, but the hot path of updating it per record will need
> >> to
> >>>>> be
> >>>>>>> as
> >>>>>>>>>>>> quick as possible). That includes both avoiding heavy
> >> computation
> >>>>> on
> >>>>>>>>> per
> >>>>>>>>>>>> record path: histograms?, measuring time for time based
> metrics
> >>>>> (per
> >>>>>>>>>>>> second) (System.currentTimeMillis() depending on the
> >>>>> implementation
> >>>>>>> can
> >>>>>>>>>>>> invoke a system call)
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us to check
> the
> >>>>> cause
> >>>>>>>>> of
> >>>>>>>>>>>> back pressure:
> >>>>>>>>>>>> a) for sources, length of input queue (in bytes? Or boolean
> >>>>>>>>>>>> hasSomethingl/isEmpty)
> >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean
> >>>>>>>>>>>> hasSomething/isEmpty)
> >>>>>>>>>>>>
> >>>>>>>>>>>> a) is useful in a scenario when we are processing backlog of
> >>>>> records,
> >>>>>>>>> all
> >>>>>>>>>>>> of the internal Flink’s input/output network buffers are
> empty,
> >>>>> and
> >>>>>>> we
> >>>>>>>>> want
> >>>>>>>>>>>> to check whether the external source system is the bottleneck
> >>>>>>> (source’s
> >>>>>>>>>>>> input queue will be empty), or if the Flink’s connector is the
> >>>>>>>>> bottleneck
> >>>>>>>>>>>> (source’s input queues will be full).
> >>>>>>>>>>>> b) similar story. Backlog of records, but this time all of the
> >>>>>>> internal
> >>>>>>>>>>>> Flink’s input/ouput network buffers are full, and we want o
> >> check
> >>>>>>>>> whether
> >>>>>>>>>>>> the external sink system is the bottleneck (sink output queues
> >> are
> >>>>>>>>> full),
> >>>>>>>>>>>> or if the Flink’s connector is the bottleneck (sink’s output
> >>>>> queues
> >>>>>>>>> will be
> >>>>>>>>>>>> empty)
> >>>>>>>>>>>>
> >>>>>>>>>>>> It might be sometimes difficult to provide those metrics, so
> >> they
> >>>>>>> could
> >>>>>>>>>>>> be optional, but if we could provide them, it would be really
> >>>>>>> helpful.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1]
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen <se...@apache.org>
> >> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think this sounds reasonable.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Let's keep the "reconfiguration without stopping the job" out
> >> of
> >>>>>>> this,
> >>>>>>>>>>>>> because that would be a super big effort and if we approach
> >> that,
> >>>>>>> then
> >>>>>>>>>>>> in
> >>>>>>>>>>>>> more generic way rather than specific to connector metrics.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I would suggest to look at the following things before
> starting
> >>>>> with
> >>>>>>>>> any
> >>>>>>>>>>>>> implementation work:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Try and find a committer to support this, otherwise it will
> >> be
> >>>>>>> hard
> >>>>>>>>>>>> to
> >>>>>>>>>>>>> make progress
> >>>>>>>>>>>>> - Start with defining a smaller set of "core metrics" and
> >> extend
> >>>>> the
> >>>>>>>>>>>> set
> >>>>>>>>>>>>> later. I think that is easier than now blocking on reaching
> >>>>>>> consensus
> >>>>>>>>>>>> on a
> >>>>>>>>>>>>> large group of metrics.
> >>>>>>>>>>>>> - Find a solution to the problem Chesnay mentioned, that the
> >>>>>>> "records
> >>>>>>>>>>>> in"
> >>>>>>>>>>>>> metric is somehow overloaded and exists already in the IO
> >> Metric
> >>>>>>>>> group.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin <
> >> becket....@gmail.com
> >>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Stephan,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks a lot for the feedback. All makes sense.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It is a good suggestion to simply have an onRecord(numBytes,
> >>>>>>>>> eventTime)
> >>>>>>>>>>>>>> method for connector writers. It should meet most of the
> >>>>>>>>> requirements,
> >>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The configurable metrics feature is something really useful,
> >>>>>>>>>>>> especially if
> >>>>>>>>>>>>>> we can somehow make it dynamically configurable without
> >> stopping
> >>>>>>> the
> >>>>>>>>>>>> jobs.
> >>>>>>>>>>>>>> It might be better to make it a separate discussion because
> it
> >>>>> is a
> >>>>>>>>>>>> more
> >>>>>>>>>>>>>> generic feature instead of only for connectors.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So in order to make some progress, in this FLIP we can limit
> >> the
> >>>>>>>>>>>> discussion
> >>>>>>>>>>>>>> scope to the connector related items:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - the standard connector metric names and types.
> >>>>>>>>>>>>>> - the abstract ConnectorMetricHandler interface
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'll start a separate thread to discuss other general metric
> >>>>>>> related
> >>>>>>>>>>>>>> enhancement items including:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - optional metrics
> >>>>>>>>>>>>>> - dynamic metric configuration
> >>>>>>>>>>>>>> - potential combination with rate limiter
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Does this plan sound reasonable?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen <
> >> se...@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Ignoring for a moment implementation details, this
> connector
> >>>>>>> metrics
> >>>>>>>>>>>> work
> >>>>>>>>>>>>>>> is a really good thing to do, in my opinion
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The questions "oh, my job seems to be doing nothing, I am
> >>>>> looking
> >>>>>>> at
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> UI
> >>>>>>>>>>>>>>> and the 'records in' value is still zero" is in the top
> three
> >>>>>>>>> support
> >>>>>>>>>>>>>>> questions I have been asked personally.
> >>>>>>>>>>>>>>> Introspection into "how far is the consumer lagging behind"
> >>>>> (event
> >>>>>>>>>>>> time
> >>>>>>>>>>>>>>> fetch latency) came up many times as well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> So big +1 to solving this problem.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> About the exact design - I would try to go for the
> following
> >>>>>>>>>>>> properties:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - keep complexity of of connectors. Ideally the metrics
> >> handler
> >>>>>>> has
> >>>>>>>>> a
> >>>>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or so, and
> >>>>> everything
> >>>>>>>>>>>> else is
> >>>>>>>>>>>>>>> internal to the handler. That makes it dead simple for the
> >>>>>>>>> connector.
> >>>>>>>>>>>> We
> >>>>>>>>>>>>>>> can also think of an extensive scheme for connector
> specific
> >>>>>>>>> metrics.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - make it configurable on the job it cluster level which
> >>>>> metrics
> >>>>>>> the
> >>>>>>>>>>>>>>> handler internally creates when that method is invoked.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Stephan
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay Schepler <
> >>>>>>>>> ches...@apache.org
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> As I said before, I believe this to be over-engineered and
> >>>>> have
> >>>>>>> no
> >>>>>>>>>>>>>>>> interest in this implementation.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> There are conceptual issues like defining a duplicate
> >>>>>>>>>>>>>> numBytesIn(PerSec)
> >>>>>>>>>>>>>>>> metric that already exists for each operator.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote:
> >>>>>>>>>>>>>>>>> A few updates to the thread. I uploaded a patch[1] as a
> >>>>> complete
> >>>>>>>>>>>>>>>>> example of how users can use the metrics in the future.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Some thoughts below after taking a look at the
> >>>>>>> AbstractMetricGroup
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> its subclasses.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This patch intends to provide convenience for Flink
> >> connector
> >>>>>>>>>>>>>>>>> implementations to follow metrics standards proposed in
> >>>>> FLIP-33.
> >>>>>>>>> It
> >>>>>>>>>>>>>>>>> also try to enhance the metric management in general way
> to
> >>>>> help
> >>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>> with:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. metric definition
> >>>>>>>>>>>>>>>>> 2. metric dependencies check
> >>>>>>>>>>>>>>>>> 3. metric validation
> >>>>>>>>>>>>>>>>> 4. metric control (turn on / off particular metrics)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the
> functionality
> >> of
> >>>>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses. The
> >>>>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the metric group
> >>>>>>> hierarchy,
> >>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>> does not really manage the metrics other than keeping
> them
> >>>>> in a
> >>>>>>>>> Map.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Ideally we should only have one entry point for the
> >> metrics.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Right now the entry point is |AbstractMetricGroup|.
> >> However,
> >>>>>>>>> besides
> >>>>>>>>>>>>>>>>> the missing functionality mentioned above,
> >>>>> |AbstractMetricGroup|
> >>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could extract it out
> to
> >>>>>>>>>>>>>>>>> flink-metrics in order to use it for generic purpose.
> There
> >>>>> will
> >>>>>>>>> be
> >>>>>>>>>>>>>>>>> some work, though.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Another approach is to make |AbstractMetrics| in this
> patch
> >>>>> as
> >>>>>>> the
> >>>>>>>>>>>>>>>>> metric entry point. It wraps metric group and provides
> the
> >>>>>>> missing
> >>>>>>>>>>>>>>>>> functionalities. Then we can roll out this pattern to
> >> runtime
> >>>>>>>>>>>>>>>>> components gradually as well.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> My first thought is that the latter approach gives a more
> >>>>> smooth
> >>>>>>>>>>>>>>>>> migration. But I am also OK with doing a refactoring on
> the
> >>>>>>>>>>>>>>>>> |AbstractMetricGroup| family.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [1] https://github.com/becketqin/flink/pull/1
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin <
> >>>>>>> becket....@gmail.com
> >>>>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Chesnay,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> It might be easier to discuss some implementation details
> >> in
> >>>>>>> the
> >>>>>>>>>>>>>>>>> PR review instead of in the FLIP discussion thread. I
> have
> >> a
> >>>>>>>>>>>>>> patch
> >>>>>>>>>>>>>>>>> for Kafka connectors ready but haven't submitted the PR
> >> yet.
> >>>>>>>>>>>>>>>>> Hopefully that will help explain a bit more.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ** Re: metric type binding
> >>>>>>>>>>>>>>>>> This is a valid point that worths discussing. If I
> >> understand
> >>>>>>>>>>>>>>>>> correctly, there are two points:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. Metric type / interface does not matter as long as the
> >>>>>>> metric
> >>>>>>>>>>>>>>>>> semantic is clearly defined.
> >>>>>>>>>>>>>>>>> Conceptually speaking, I agree that as long as the metric
> >>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>> is defined, metric type does not matter. To some extent,
> >>>>> Gauge
> >>>>>>> /
> >>>>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can be think of as
> >>>>> some
> >>>>>>>>>>>>>>>>> well-recognized semantics, if you wish. In Flink, these
> >>>>> metric
> >>>>>>>>>>>>>>>>> semantics have their associated interface classes. In
> >>>>> practice,
> >>>>>>>>>>>>>>>>> such semantic to interface binding seems necessary for
> >>>>>>> different
> >>>>>>>>>>>>>>>>> components to communicate.  Simply standardize the
> semantic
> >>>>> of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> connector metrics seems not sufficient for people to
> build
> >>>>>>>>>>>>>>>>> ecosystem on top of. At the end of the day, we still need
> >> to
> >>>>>>>>> have
> >>>>>>>>>>>>>>>>> some embodiment of the metric semantics that people can
> >>>>> program
> >>>>>>>>>>>>>>>>> against.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be exposed
> using
> >>>>>>>>>>>>>>>>> different metric types / interfaces.
> >>>>>>>>>>>>>>>>> This is a good point. Counter and Gauge-as-a-Counter are
> >>>>> pretty
> >>>>>>>>>>>>>>>>> much interchangeable. This is more of a trade-off between
> >> the
> >>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>> experience of metric producers and consumers. The metric
> >>>>>>>>>>>>>> producers
> >>>>>>>>>>>>>>>>> want to use Counter or Gauge depending on whether the
> >> counter
> >>>>>>> is
> >>>>>>>>>>>>>>>>> already tracked in code, while ideally the metric
> consumers
> >>>>>>> only
> >>>>>>>>>>>>>>>>> want to see a single metric type for each metric. I am
> >>>>> leaning
> >>>>>>>>>>>>>>>>> towards to make the metric producers happy, i.e. allow
> >> Gauge
> >>>>> /
> >>>>>>>>>>>>>>>>> Counter metric type, and the the metric consumers handle
> >> the
> >>>>>>>>> type
> >>>>>>>>>>>>>>>>> variation. The reason is that in practice, there might be
> >>>>> more
> >>>>>>>>>>>>>>>>> connector implementations than metric reporter
> >>>>> implementations.
> >>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>> could also provide some helper method to facilitate
> reading
> >>>>>>> from
> >>>>>>>>>>>>>>>>> such variable metric type.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Just some quick replies to the comments around
> >> implementation
> >>>>>>>>>>>>>>>> details.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   4) single place where metrics are registered except
> >>>>>>>>>>>>>>>>>   connector-specific
> >>>>>>>>>>>>>>>>>   ones (which we can't really avoid).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Register connector specific ones in a single place is
> >>>>> actually
> >>>>>>>>>>>>>>>>> something that I want to achieve.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   2) I'm talking about time-series databases like
> >>>>> Prometheus.
> >>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>   would
> >>>>>>>>>>>>>>>>>   only have a gauge metric exposing the last
> >>>>>>>>> fetchTime/emitTime
> >>>>>>>>>>>>>>>>>   that is
> >>>>>>>>>>>>>>>>>   regularly reported to the backend (Prometheus), where a
> >>>>>>> user
> >>>>>>>>>>>>>>>>>   could build
> >>>>>>>>>>>>>>>>>   a histogram of his choosing when/if he wants it.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Not sure if such downsampling works. As an example, if a
> >> user
> >>>>>>>>>>>>>>>>> complains that there are some intermittent latency spikes
> >>>>>>> (maybe
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> few records in 10 seconds) in their processing system.
> >>>>> Having a
> >>>>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems unlikely
> useful.
> >>>>>>>>>>>>>>>>> However by looking at actual 99.9 percentile latency
> might
> >>>>>>> help.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay Schepler
> >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   Re: over complication of implementation.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   I think I get understand better know what you're
> >> shooting
> >>>>>>>>>>>>>> for,
> >>>>>>>>>>>>>>>>>   effectively something like the OperatorIOMetricGroup.
> >>>>>>>>>>>>>>>>>   But still, re-define setupConnectorMetrics() to accept
> a
> >>>>>>> set
> >>>>>>>>>>>>>>>>>   of flags
> >>>>>>>>>>>>>>>>>   for counters/meters(ans _possibly_ histograms) along
> >>>>> with a
> >>>>>>>>>>>>>>>>>   set of
> >>>>>>>>>>>>>>>>>   well-defined Optional<Gauge<?>>, and return the group.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   Solves all issues as far as i can tell:
> >>>>>>>>>>>>>>>>>   1) no metrics must be created manually (except Gauges,
> >>>>>>> which
> >>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>   effectively just Suppliers and you can't get around
> >>>>> this),
> >>>>>>>>>>>>>>>>>   2) additional metrics can be registered on the returned
> >>>>>>>>>>>>>> group,
> >>>>>>>>>>>>>>>>>   3) see 1),
> >>>>>>>>>>>>>>>>>   4) single place where metrics are registered except
> >>>>>>>>>>>>>>>>>   connector-specific
> >>>>>>>>>>>>>>>>>   ones (which we can't really avoid).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   Re: Histogram
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   1) As an example, whether "numRecordsIn" is exposed as
> a
> >>>>>>>>>>>>>>>>>   Counter or a
> >>>>>>>>>>>>>>>>>   Gauge should be irrelevant. So far we're using the
> >> metric
> >>>>>>>>>>>>>> type
> >>>>>>>>>>>>>>>>>   that is
> >>>>>>>>>>>>>>>>>   the most convenient at exposing a given value. If there
> >>>>> is
> >>>>>>>>>>>>>>>>>   some backing
> >>>>>>>>>>>>>>>>>   data-structure that we want to expose some data from we
> >>>>>>>>>>>>>>>>>   typically opt
> >>>>>>>>>>>>>>>>>   for a Gauge, as otherwise we're just mucking around
> with
> >>>>>>> the
> >>>>>>>>>>>>>>>>>   Meter/Counter API to get it to match. Similarly, if we
> >>>>> want
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>   count
> >>>>>>>>>>>>>>>>>   something but no current count exists we typically
> added
> >>>>> a
> >>>>>>>>>>>>>>>>>   Counter.
> >>>>>>>>>>>>>>>>>   That's why attaching semantics to metric types makes
> >>>>> little
> >>>>>>>>>>>>>>>>>   sense (but
> >>>>>>>>>>>>>>>>>   unfortunately several reporters already do it); for
> >>>>>>>>>>>>>>>>>   counters/meters
> >>>>>>>>>>>>>>>>>   certainly, but the majority of metrics are gauges.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   2) I'm talking about time-series databases like
> >>>>> Prometheus.
> >>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>   would
> >>>>>>>>>>>>>>>>>   only have a gauge metric exposing the last
> >>>>>>>>> fetchTime/emitTime
> >>>>>>>>>>>>>>>>>   that is
> >>>>>>>>>>>>>>>>>   regularly reported to the backend (Prometheus), where a
> >>>>>>> user
> >>>>>>>>>>>>>>>>>   could build
> >>>>>>>>>>>>>>>>>   a histogram of his choosing when/if he wants it.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>   On 22.02.2019 13:57, Becket Qin wrote:
> >>>>>>>>>>>>>>>>>> Hi Chesnay,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the explanation.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> ** Re: FLIP
> >>>>>>>>>>>>>>>>>> I might have misunderstood this, but it seems that
> "major
> >>>>>>>>>>>>>>>>>   changes" are well
> >>>>>>>>>>>>>>>>>> defined in FLIP. The full contents is following:
> >>>>>>>>>>>>>>>>>> What is considered a "major change" that needs a FLIP?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Any of the following should be considered a major
> change:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Any major new feature, subsystem, or piece of
> >>>>>>>>>>>>>>>>>   functionality
> >>>>>>>>>>>>>>>>>> - *Any change that impacts the public interfaces of the
> >>>>>>>>>>>>>>>>>   project*
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> What are the "public interfaces" of the project?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> *All of the following are public interfaces *that people
> >>>>>>>>>>>>>>>>>   build around:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - DataStream and DataSet API, including classes related
> >>>>>>>>>>>>>>>>>   to that, such as
> >>>>>>>>>>>>>>>>>> StreamExecutionEnvironment
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Classes marked with the @Public annotation
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - On-disk binary formats, such as
> >>>>>>>>>>>>>> checkpoints/savepoints
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - User-facing scripts/command-line tools, i.e.
> >>>>>>>>>>>>>>>>>   bin/flink, Yarn scripts,
> >>>>>>>>>>>>>>>>>> Mesos scripts
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Configuration settings
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - *Exposed monitoring information*
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> So any monitoring information change is considered as
> >>>>>>>>>>>>>> public
> >>>>>>>>>>>>>>>>>   interface, and
> >>>>>>>>>>>>>>>>>> any public interface change is considered as a "major
> >>>>>>>>>>>>>>> change".
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> ** Re: over complication of implementation.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Although this is more of implementation details that is
> >> not
> >>>>>>>>>>>>>>>>>   covered by the
> >>>>>>>>>>>>>>>>>> FLIP. But it may be worth discussing.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> First of all, I completely agree that we should use the
> >>>>>>>>>>>>>>>>>   simplest way to
> >>>>>>>>>>>>>>>>>> achieve our goal. To me the goal is the following:
> >>>>>>>>>>>>>>>>>> 1. Clear connector conventions and interfaces.
> >>>>>>>>>>>>>>>>>> 2. The easiness of creating a connector.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Both of them are important to the prosperity of the
> >>>>>>>>>>>>>>>>>   connector ecosystem. So
> >>>>>>>>>>>>>>>>>> I'd rather abstract as much as possible on our side to
> >> make
> >>>>>>>>>>>>>>>>>   the connector
> >>>>>>>>>>>>>>>>>> developer's work lighter. Given this goal, a static util
> >>>>>>>>>>>>>>>>>   method approach
> >>>>>>>>>>>>>>>>>> might have a few drawbacks:
> >>>>>>>>>>>>>>>>>> 1. Users still have to construct the metrics by
> >> themselves.
> >>>>>>>>>>>>>>>>>   (And note that
> >>>>>>>>>>>>>>>>>> this might be erroneous by itself. For example, a
> customer
> >>>>>>>>>>>>>>>>>   wrapper around
> >>>>>>>>>>>>>>>>>> dropwizard meter maybe used instead of MeterView).
> >>>>>>>>>>>>>>>>>> 2. When connector specific metrics are added, it is
> >>>>>>>>>>>>>>>>>   difficult to enforce
> >>>>>>>>>>>>>>>>>> the scope to be the same as standard metrics.
> >>>>>>>>>>>>>>>>>> 3. It seems that a method proliferation is inevitable if
> >> we
> >>>>>>>>>>>>>>>>>   want to apply
> >>>>>>>>>>>>>>>>>> sanity checks. e.g. The metric of numBytesIn was not
> >>>>>>>>>>>>>>>>>   registered for a meter.
> >>>>>>>>>>>>>>>>>> 4. Metrics are still defined in random places and hard
> to
> >>>>>>>>>>>>>>>> track.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The current PR I had was inspired by the Config system
> in
> >>>>>>>>>>>>>>>>>   Kafka, which I
> >>>>>>>>>>>>>>>>>> found pretty handy. In fact it is not only used by Kafka
> >>>>>>>>>>>>>>>>>   itself but even
> >>>>>>>>>>>>>>>>>> some other projects that depend on Kafka. I am not
> saying
> >>>>>>>>>>>>>>>>>   this approach is
> >>>>>>>>>>>>>>>>>> perfect. But I think it worths to save the work for
> >>>>>>>>>>>>>>>>>   connector writers and
> >>>>>>>>>>>>>>>>>> encourage more systematic implementation. That being
> said,
> >>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>   am fully open
> >>>>>>>>>>>>>>>>>> to suggestions.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Re: Histogram
> >>>>>>>>>>>>>>>>>> I think there are two orthogonal questions around those
> >>>>>>>>>>>>>>>> metrics:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just looking at the
> >>>>>>>>>>>>>>>>>   meaning of a
> >>>>>>>>>>>>>>>>>> metric, is generic to all connectors? If the answer is
> >> yes,
> >>>>>>>>>>>>>>>>>   we should
> >>>>>>>>>>>>>>>>>> include the metric into the convention. No matter
> whether
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>   include it
> >>>>>>>>>>>>>>>>>> into the convention or not, some connector
> implementations
> >>>>>>>>>>>>>>>>>   will emit such
> >>>>>>>>>>>>>>>>>> metric. It is better to have a convention than letting
> >> each
> >>>>>>>>>>>>>>>>>   connector do
> >>>>>>>>>>>>>>>>>> random things.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 2. If a standard metric is a histogram, what should we
> do?
> >>>>>>>>>>>>>>>>>> I agree that we should make it clear that using
> histograms
> >>>>>>>>>>>>>>>>>   will have
> >>>>>>>>>>>>>>>>>> performance risk. But I do see histogram is useful in
> some
> >>>>>>>>>>>>>>>>>   fine-granularity
> >>>>>>>>>>>>>>>>>> debugging where one do not have the luxury to stop the
> >>>>>>>>>>>>>>>>>   system and inject
> >>>>>>>>>>>>>>>>>> more inspection code. So the workaround I am thinking is
> >> to
> >>>>>>>>>>>>>>>>>   provide some
> >>>>>>>>>>>>>>>>>> implementation suggestions. Assume later on we have a
> >>>>>>>>>>>>>>>>>   mechanism of
> >>>>>>>>>>>>>>>>>> selective metrics. In the abstract metrics class we can
> >>>>>>>>>>>>>>>>>   disable those
> >>>>>>>>>>>>>>>>>> metrics by default individual connector writers does not
> >>>>>>>>>>>>>>>>>   have to do
> >>>>>>>>>>>>>>>>>> anything (this is another advantage of having an
> >>>>>>>>>>>>>>>>>   AbstractMetrics instead of
> >>>>>>>>>>>>>>>>>> static util methods.)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I am not sure I fully understand the histogram in the
> >>>>>>>>>>>>>>>>>   backend approach. Can
> >>>>>>>>>>>>>>>>>> you explain a bit more? Do you mean emitting the raw
> data,
> >>>>>>>>>>>>>>>>>   e.g. fetchTime
> >>>>>>>>>>>>>>>>>> and emitTime with each record and let the histogram
> >>>>>>>>>>>>>>>>>   computation happen in
> >>>>>>>>>>>>>>>>>> the background? Or let the processing thread putting the
> >>>>>>>>>>>>>>>>>   values into a
> >>>>>>>>>>>>>>>>>> queue and have a separate thread polling from the queue
> >> and
> >>>>>>>>>>>>>>>>>   add them into
> >>>>>>>>>>>>>>>>>> the histogram?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler
> >>>>>>>>>>>>>>>>>   <ches...@apache.org <mailto:ches...@apache.org>>
> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Re: Flip
> >>>>>>>>>>>>>>>>>>> The very first line under both the main header and
> >> Purpose
> >>>>>>>>>>>>>>>>>   section
> >>>>>>>>>>>>>>>>>>> describe Flips as "major changes", which this isn't.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Re: complication
> >>>>>>>>>>>>>>>>>>> I'm not arguing against standardization, but again an
> >>>>>>>>>>>>>>>>>   over-complicated
> >>>>>>>>>>>>>>>>>>> implementation when a static utility method would be
> >>>>>>>>>>>>>>>>>   sufficient.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> public static void setupConnectorMetrics(
> >>>>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup,
> >>>>>>>>>>>>>>>>>>> String connectorName,
> >>>>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn,
> >>>>>>>>>>>>>>>>>>> ...)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This gives you all you need:
> >>>>>>>>>>>>>>>>>>> * a well-defined set of metrics for a connector to
> opt-in
> >>>>>>>>>>>>>>>>>>> * standardized naming schemes for scope and individual
> >>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>> * standardize metric types (although personally I'm not
> >>>>>>>>>>>>>>>>>   interested in that
> >>>>>>>>>>>>>>>>>>> since metric types should be considered syntactic
> sugar)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Re: Configurable Histogram
> >>>>>>>>>>>>>>>>>>> If anything they _must_ be turned off by default, but
> the
> >>>>>>>>>>>>>>>>>   metric system is
> >>>>>>>>>>>>>>>>>>> already exposing so many options that I'm not too keen
> on
> >>>>>>>>>>>>>>>>>   adding even more.
> >>>>>>>>>>>>>>>>>>> You have also only addressed my first argument against
> >>>>>>>>>>>>>>>>>   histograms
> >>>>>>>>>>>>>>>>>>> (performance), the second one still stands (calculate
> >>>>>>>>>>>>>>>>>   histogram in metric
> >>>>>>>>>>>>>>>>>>> backends instead).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote:
> >>>>>>>>>>>>>>>>>>>> Hi Chesnay,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is worthy of a
> >> FLIP
> >>>>>>>>>>>>>>>>>   because it is
> >>>>>>>>>>>>>>>>>>>> public API. According to the FLIP description a FlIP
> is
> >>>>>>>>>>>>>>>>>   required in case
> >>>>>>>>>>>>>>>>>>> of:
> >>>>>>>>>>>>>>>>>>>> - Any change that impacts the public interfaces of
> >>>>>>>>>>>>>>>>>   the project
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> and the following entry is found in the definition of
> >>>>>>>>>>>>>>>>>   "public interface".
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> - Exposed monitoring information
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Metrics are critical to any production system. So a
> >> clear
> >>>>>>>>>>>>>>>>>   metric
> >>>>>>>>>>>>>>>>>>> definition
> >>>>>>>>>>>>>>>>>>>> is important for any serious users. For an
> organization
> >>>>>>>>>>>>>>>>>   with large Flink
> >>>>>>>>>>>>>>>>>>>> installation, change in metrics means great amount of
> >>>>>>>>>>>>>>>>>   work. So such
> >>>>>>>>>>>>>>>>>>> changes
> >>>>>>>>>>>>>>>>>>>> do need to be fully discussed and documented.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> ** Re: Histogram.
> >>>>>>>>>>>>>>>>>>>> We can discuss whether there is a better way to expose
> >>>>>>>>>>>>>>>>>   metrics that are
> >>>>>>>>>>>>>>>>>>>> suitable for histograms. My micro-benchmark on various
> >>>>>>>>>>>>>>>>>   histogram
> >>>>>>>>>>>>>>>>>>>> implementations also indicates that they are
> >>>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>   slower than
> >>>>>>>>>>>>>>>>>>>> other metric types. But I don't think that means never
> >>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>   histogram, but
> >>>>>>>>>>>>>>>>>>>> means use it with caution. For example, we can suggest
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> implementations
> >>>>>>>>>>>>>>>>>>>> to turn them off by default and only turn it on for a
> >>>>>>>>>>>>>>>>>   small amount of
> >>>>>>>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>>>> when performing some micro-debugging.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> ** Re: complication:
> >>>>>>>>>>>>>>>>>>>> Connector conventions are essential for Flink
> ecosystem.
> >>>>>>>>>>>>>>>>>   Flink connectors
> >>>>>>>>>>>>>>>>>>>> pool is probably the most important part of Flink,
> just
> >>>>>>>>>>>>>>>>>   like any other
> >>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>> system. Clear conventions of connectors will help
> build
> >>>>>>>>>>>>>>>>>   Flink ecosystem
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> a more organic way.
> >>>>>>>>>>>>>>>>>>>> Take the metrics convention as an example, imagine
> >>>>>>>>>>>>>> someone
> >>>>>>>>>>>>>>>>>   has developed
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> Flink connector for System foo, and another developer
> >> may
> >>>>>>>>>>>>>>>>>   have developed
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for Flink which
> >>>>>>>>>>>>>>>>>   analyzes the Flink
> >>>>>>>>>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>>>>> performance based on metrics. With a clear metric
> >>>>>>>>>>>>>>>>>   convention, those two
> >>>>>>>>>>>>>>>>>>>> projects could be developed independently. Once users
> >> put
> >>>>>>>>>>>>>>>>>   them together,
> >>>>>>>>>>>>>>>>>>>> it would work without additional modifications. This
> >>>>>>>>>>>>>>>>>   cannot be easily
> >>>>>>>>>>>>>>>>>>>> achieved by just defining a few constants.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> ** Re: selective metrics:
> >>>>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate thread.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> @Dawid
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency
> >>>>>>>>>>>>>>>>>>>> The primary purpose of establish such a convention is
> to
> >>>>>>>>>>>>>>>>>   help developers
> >>>>>>>>>>>>>>>>>>>> write connectors in a more compatible way. The
> >> convention
> >>>>>>>>>>>>>>>>>   is supposed to
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> defined more proactively. So when look at the
> >> convention,
> >>>>>>>>>>>>>>>>>   it seems more
> >>>>>>>>>>>>>>>>>>>> important to see if the concept is applicable to
> >>>>>>>>>>>>>>>>>   connectors in general.
> >>>>>>>>>>>>>>>>>>> It
> >>>>>>>>>>>>>>>>>>>> might be true so far only Kafka connector reports
> >>>>>>>>>>>>>> latency.
> >>>>>>>>>>>>>>>>>   But there
> >>>>>>>>>>>>>>>>>>> might
> >>>>>>>>>>>>>>>>>>>> be hundreds of other connector implementations in the
> >>>>>>>>>>>>>>>>>   Flink ecosystem,
> >>>>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of them also
> >> emits
> >>>>>>>>>>>>>>>>>   latency. I
> >>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>> a lot of other sources actually also has an append
> >>>>>>>>>>>>>>>>>   timestamp. e.g.
> >>>>>>>>>>>>>>>>>>> database
> >>>>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I wouldn't be
> surprised
> >>>>>>>>>>>>>>>>>   if some database
> >>>>>>>>>>>>>>>>>>>> connector can also emit latency metrics.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler
> >>>>>>>>>>>>>>>>>   <ches...@apache.org <mailto:ches...@apache.org>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to investigate
> this
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>   part of this
> >>>>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of interest for
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>   entire metric
> >>>>>>>>>>>>>>>>>>>>> system, and should be designed for as such.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to the core of
> >>>>>>>>>>>>>>>>>   Flink. They are
> >>>>>>>>>>>>>>>>>>>>> significantly more expensive than other metrics, and
> >>>>>>>>>>>>>>>>>   calculating
> >>>>>>>>>>>>>>>>>>>>> histograms in the application is regarded as an
> >>>>>>>>>>>>>>>>>   anti-pattern by several
> >>>>>>>>>>>>>>>>>>>>> metric backends, who instead recommend to expose the
> >> raw
> >>>>>>>>>>>>>>>>>   data and
> >>>>>>>>>>>>>>>>>>>>> calculate the histogram in the backend.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Second, this seems overly complicated. Given that we
> >>>>>>>>>>>>>>>>>   already established
> >>>>>>>>>>>>>>>>>>>>> that not all connectors will export all metrics we
> are
> >>>>>>>>>>>>>>>>>   effectively
> >>>>>>>>>>>>>>>>>>>>> reducing this down to a consistent naming scheme. We
> >>>>>>>>>>>>>>>>>   don't need anything
> >>>>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a few
> constants
> >>>>>>>>>>>>>>>>>   that all
> >>>>>>>>>>>>>>>>>>>>> connectors use.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of a FLIP.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz wrote:
> >>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I agree. But those
> >>>>>>>>>>>>>>>>>   particular metrics
> >>>>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now would only be
> >>>>>>>>>>>>>>>>>   reported if user uses
> >>>>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal timestampAssigner with
> >>>>>>>>>>>>>>>>>   StreamCharacteristic
> >>>>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds like a very
> >>>>>>>>>>>>>>>>>   specific case. I am
> >>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic metric that
> will
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> disabled/absent for most of implementations.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal issue, that
> might
> >>>>>>>>>>>>>>>>>   make sense to
> >>>>>>>>>>>>>>>>>>>>>> investigate in the future.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Dawid
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote:
> >>>>>>>>>>>>>>>>>>>>>>> Hi Dawid,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes sense to me.
> >> There
> >>>>>>>>>>>>>>>>>   are two cases
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>> addressed.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a guidance. It is
> >>>>>>>>>>>>>>>>>   likely that a
> >>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>> only supports some but not all of the metrics. In
> >> that
> >>>>>>>>>>>>>>>>>   case, each
> >>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>> implementation should have the freedom to decide
> >> which
> >>>>>>>>>>>>>>>>>   metrics are
> >>>>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are supported, the
> >>>>>>>>>>>>>>>>>   guidance should be
> >>>>>>>>>>>>>>>>>>>>>>> followed.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable certain
> >> metrics
> >>>>>>>>>>>>>>>>>   for some reason
> >>>>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of data). A
> generic
> >>>>>>>>>>>>>>>>>   mechanism should
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>> provided to allow user choose which metrics are
> >>>>>>>>>>>>>>>>>   reported. This
> >>>>>>>>>>>>>>>>>>> mechanism
> >>>>>>>>>>>>>>>>>>>>>>> should also be honored by the connector
> >>>>>>>>>>>>>> implementations.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz <
> >>>>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto:
> dwysakow...@apache.org
> >>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a unified,
> >>>>>>>>>>>>>>>>>   standard set of
> >>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight concerns about
> >>>>>>>>>>>>>>>>>   fetchLatency and
> >>>>>>>>>>>>>>>>>>>>>>>> latency though. They are computed based on
> EventTime
> >>>>>>>>>>>>>>>>>   which is not a
> >>>>>>>>>>>>>>>>>>>>> purely
> >>>>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often on some
> business
> >>>>>>>>>>>>>>>>>   logic, might be
> >>>>>>>>>>>>>>>>>>>>> absent
> >>>>>>>>>>>>>>>>>>>>>>>> or defined after source. Those metrics could also
> >>>>>>>>>>>>>>>>>   behave in a weird
> >>>>>>>>>>>>>>>>>>>>> way in
> >>>>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore I am not sure
> >> if
> >>>>>>>>>>>>>>>>>   we should
> >>>>>>>>>>>>>>>>>>> include
> >>>>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we could at least
> >>>>>>>>>>>>>>>>>   introduce a feature
> >>>>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Dawid
> >>>>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to the proposed
> >>>>>>>>>>>>>>>>>   metrics. I'll start a
> >>>>>>>>>>>>>>>>>>>>>>>> voting thread later toady.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin
> >>>>>>>>>>>>>>>>>   <becket....@gmail.com <mailto:becket....@gmail.com>> <
> >>>>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto:becket....@gmail.com>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> Hi folks,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP discussion thread
> >>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>   standardize
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> connector metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a convention of
> >>>>>>>>>>>>>>>>>   Flink connector
> >>>>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the monitoring and
> >>>>>>>>>>>>>>>>>   alerting on Flink
> >>>>>>>>>>>>>>>>>>>>> jobs.
> >>>>>>>>>>>>>>>>>>>>>>>> The FLIP link is following:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>
> >>
>
>

Re: [DISCUSS] FLIP-33: Standardize connector metrics

Reply via email to