Re: [DISCUSS] FLIP-33: Standardize connector metrics

Konstantin Knauf Tue, 22 Sep 2020 00:08:46 -0700

+1 from my side

On Fri, Sep 18, 2020 at 4:54 PM Stephan Ewen <step...@ververica.com> wrote:


> Having the watermark lag metric was the important part from my side.
>
> So +1 to go ahead.
>
> On Fri, Sep 18, 2020 at 4:11 PM Becket Qin <becket....@gmail.com> wrote:
>
> > Hey Stephan,
> >
> > Thanks for the quick reply. I actually forgot to mention that I also
> added
> > a "watermarkLag" metric to the Source metrics in addition to the
> > "currentFetchEventTimeLag" and "currentEmitEventTimeLag".
> > So in the current proposal, both XXEventTimeLag will be based on event
> > timestamps, while the watermarkLag is based on watermark. And all the
> lags
> > are defined against the wall clock time. Does this cover what you are
> > proposing?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Sep 18, 2020 at 6:08 PM Stephan Ewen <step...@ververica.com>
> > wrote:
> >
> >> Hi Becket!
> >>
> >> I am wondering if it makes sense to do the following small change:
> >>
> >>   - Have "currentFetchEventTimeLag" be defined on event timestamps
> >> (optionally, if the source system exposes it)  <== this is like in your
> >> proposal
> >>       this helps understand how long the records were in the source
> before
> >>
> >>   - BUT change "currentEmitEventTimeLag" to be
> >> "currentSourceWatermarkLag" instead.
> >>     That way, users can see how far behind the wall-clock-time the
> source
> >> progress is all in all.
> >>     It is also a more well defined metric as it does not oscillate with
> >> out-of-order events.
> >>
> >> What do you think?
> >>
> >> Best,
> >> Stephan
> >>
> >>
> >>
> >> On Fri, Sep 18, 2020 at 12:02 PM Becket Qin <becket....@gmail.com>
> wrote:
> >>
> >>> Hi folks,
> >>>
> >>> Thanks for all the great feedback. I have just updated FLIP-33 wiki
> with
> >>> the following changes:
> >>>
> >>> 1. Renaming. "currentFetchLatency" to "currentFetchEventTimeLag",
> >>> "currentLatency" to "currentEmitEventTimeLag".
> >>> 2. Added the public interface code change required for the new metrics.
> >>> 3. Added description of whether a metric is predefined or optional, and
> >>> which component is expected to update the metric.
> >>>
> >>> Please let me know if you have any questions. I'll start a vote in two
> >>> days if there are no further concerns.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Wed, Sep 9, 2020 at 9:56 AM Becket Qin <becket....@gmail.com>
> wrote:
> >>>
> >>>> Hi Stephan,
> >>>>
> >>>> Thanks for the input. Just a few more clarifications / questions.
> >>>>
> >>>> *Num Bytes / Records Metrics*
> >>>>
> >>>> 1. At this point, the *numRecordsIn(Rate)* metrics exist in both
> >>>> OperatorIOMetricGroup and TaskIOMetricGroup. I did not find
> >>>> *numRecordsIn(Rate)* in the TaskIOMetricGroup updated anywhere other
> >>>> than in the unit tests. Am I missing something?
> >>>>
> >>>> 2. *numBytesIn(Rate)* metrics only exist in TaskIOMetricGroup. At this
> >>>> point, the SourceReaders only has access to a SourceReaderContext
> which
> >>>> provides an OperatorMetricGroup. So it seems that the connector
> developers
> >>>> are not able to update the *numBytesIn(Rate). *With the multiple
> >>>> Source chaining support, it is possible that there are multiple
> Sources are
> >>>> in the same task. So it looks that we need to add *numBytesIn(Rate)*
> >>>> to the operator metrics as well.
> >>>>
> >>>>
> >>>> *Current (Fetch) Latency*
> >>>>
> >>>> *currentFetchLatency* helps clearly tell whether the latency is caused
> >>>> by Flink or not. Backpressure is not the only reason that we see fetch
> >>>> latency. Even if there is no back pressure, the records may have
> passed a
> >>>> long pipeline before they entered Flink. For example, say the
> *currentLatency
> >>>> *is 10 seconds and there is no backpressure. Does that mean the record
> >>>> spent 10 seconds in the Source operator? If not, how much did Flink
> >>>> contribute to that 10 seconds of latency? These questions are
> frequently
> >>>> asked and hard to tell without the fetch latency.
> >>>>
> >>>> For "currentFetchLatency", we would need to understand timestamps
> >>>>> before the records are decoded. That is only possible for some
> sources,
> >>>>> where the client gives us the records in a (partially) decoded from
> already
> >>>>> (like Kafka). Then, some work has been done between the fetch time
> and the
> >>>>> time we update the metric already, so it is already a bit closer to
> the
> >>>>> "currentFetchLatency". I think following this train of thought,
> there is
> >>>>> diminished benefit from that specific metric.
> >>>>
> >>>>
> >>>> We may not have to report the fetch latency before records are
> decoded.
> >>>> One solution is to remember the* FetchTime* when the encoded records
> >>>> are fetched, and report the fetch latency after the records are
> decoded by
> >>>> computing (*FetchTime - EventTime*). An approximate implementation
> >>>> would be adding a *FetchTime *field to the *RecordsWithSplitIds*
> assuming
> >>>> that all the records in that data structure are fetched at the same
> time.
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>> On Wed, Sep 9, 2020 at 12:42 AM Stephan Ewen <step...@ververica.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks for reviving this, Becket!
> >>>>>
> >>>>> I think Konstantin's comments are great. I'd add these points:
> >>>>>
> >>>>> *Num Bytes / Records Metrics*
> >>>>>
> >>>>> For "numBytesIn" and "numRecordsIn", we should reuse the
> >>>>> OperatorIOMetric group, then it also gets reported to the overview
> page in
> >>>>> the Web UI.
> >>>>>
> >>>>> The "numBytesInPerSecond" and "numRecordsInPerSecond" are
> >>>>> automatically derived metrics, no need to do anything once we
> populate the
> >>>>> above two metrics
> >>>>>
> >>>>>
> >>>>> *Current (Fetch) Latency*
> >>>>>
> >>>>> I would really go for "eventTimeLag" rather than "fetchLatency". I
> >>>>> think "eventTimeLag" is a term that has some adoption in the Flink
> >>>>> community and beyond.
> >>>>>
> >>>>> I am not so sure that I see the benefit between "currentLatency" and
> >>>>> "currentFetchLatency", (or event time lag before/after) as this only
> is
> >>>>> different by the time it takes to emit a batch.
> >>>>>      - In a non-backpressured case, these should be virtually
> >>>>> identical (and both dominated by watermark lag, not the actual time
> it
> >>>>> takes the fetch to be emitted)
> >>>>>      - In a backpressured case, why do you care about when data was
> >>>>> fetched, as opposed to emitted? Emitted time is relevant for
> application
> >>>>> semantics and checkpoints. Fetch time seems to be an implementation
> detail
> >>>>> (how much does the source buffer).
> >>>>>
> >>>>> The "currentLatency" (eventTimeLagAfter) can be computed
> >>>>> out-of-the-box, independent of a source implementation, so that is
> also a
> >>>>> good argument to make it the main metric.
> >>>>> We know timestamps and watermarks in the source. Except for cases
> >>>>> where no watermarks have been defined at all (batch jobs or pure
> processing
> >>>>> time jobs), in which case this metric should probably be "Infinite".
> >>>>>
> >>>>> For "currentFetchLatency", we would need to understand timestamps
> >>>>> before the records are decoded. That is only possible for some
> sources,
> >>>>> where the client gives us the records in a (partially) decoded from
> already
> >>>>> (like Kafka). Then, some work has been done between the fetch time
> and the
> >>>>> time we update the metric already, so it is already a bit closer to
> the
> >>>>> "currentFetchLatency". I think following this train of thought,
> there is
> >>>>> diminished benefit from that specific metric.
> >>>>>
> >>>>>
> >>>>> *Idle Time*
> >>>>>
> >>>>> I agree, it would be great to rename this. Maybe to "sourceWaitTime"
> >>>>> or "sourceIdleTime" so to make clear that this is not exactly the
> time that
> >>>>> Flink's processing pipeline is idle, but the time where the source
> does not
> >>>>> have new data.
> >>>>>
> >>>>> This is not an easy metric to collect, though (except maybe for the
> >>>>> sources that are only idle while they have no split assigned, like
> >>>>> continuous file source).
> >>>>>
> >>>>> *Source Specific Metrics*
> >>>>>
> >>>>> I believe source-specific would only be "sourceIdleTime",
> >>>>> "numRecordsInErrors", "pendingBytes", and "pendingRecords".
> >>>>>
> >>>>>
> >>>>> *Conclusion*
> >>>>>
> >>>>> We can probably add "numBytesIn" and "numRecordsIn" and
> "eventTimeLag"
> >>>>> right away, with little complexity.
> >>>>> I'd suggest to start with these right away.
> >>>>>
> >>>>> Best,
> >>>>> Stephan
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 8, 2020 at 3:25 PM Becket Qin <becket....@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hey Konstantin,
> >>>>>>
> >>>>>> Thanks for the feedback and suggestions. Please see the reply below.
> >>>>>>
> >>>>>> * idleTime: In the meantime, a similar metric "idleTimeMsPerSecond"
> >>>>>>> has
> >>>>>>> been introduced in
> https://issues.apache.org/jira/browse/FLINK-16864.
> >>>>>>> They
> >>>>>>> have a similar name, but different definitions of idleness,
> >>>>>>> e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it
> is
> >>>>>>> backpressured. Can we make it clearer that these two metrics mean
> >>>>>>> different
> >>>>>>> things?
> >>>>>>
> >>>>>>
> >>>>>> That is a good point. I did not notice this metric earlier. It seems
> >>>>>> that both metrics are useful to the users. One tells them how busy
> the
> >>>>>> source is and how much more throughput the source can handle. The
> other
> >>>>>> tells the users how long since the source has seen the last record,
> which
> >>>>>> is useful for debugging. I'll update the FLIP to make it clear.
> >>>>>>
> >>>>>>   * "current(Fetch)Latency" I am wondering if
> >>>>>>> "eventTimeLag(Before|After)"
> >>>>>>> is more descriptive/clear. What do others think?
> >>>>>>
> >>>>>>
> >>>>>> I am quite open to the ideas on these names. In fact I also feel
> >>>>>> "current(Fetch)Latency" are not super intuitive. So it would be
> great if we
> >>>>>> can have better names.
> >>>>>>
> >>>>>>   * Current(Fetch)Latency implies that the timestamps are directly
> >>>>>>> extracted in the source connector, right? Will this be the default
> >>>>>>> for
> >>>>>>> FLIP-27 sources anyway?
> >>>>>>
> >>>>>>
> >>>>>> The "currentFetchLatency" will probably be reported by each source
> >>>>>> implementation, because the data fetching is done by SplitReaders
> and there
> >>>>>> is no base implementation. The "currentLatency", on the other hand,
> can be
> >>>>>> reported by the SourceReader base implementation. However, since
> developers
> >>>>>> can actually implement their own source connector without using our
> base
> >>>>>> implementation, these metric guidance are still useful for the
> connector
> >>>>>> developers in that case.
> >>>>>>
> >>>>>> * Does FLIP-33 also include the implementation of these metrics (to
> >>>>>>> the
> >>>>>>> extent possible) for all connectors currently available in Apache
> >>>>>>> Flink or
> >>>>>>> is the "per-connector implementation" out-of-scope?
> >>>>>>
> >>>>>>
> >>>>>> FLIP-33 itself does not specify any implementation of those metrics.
> >>>>>> But the connectors we provide in Apache Flink will follow the
> guidance of
> >>>>>> FLIP-33 to emit those metrics when applicable. Maybe We can have
> some
> >>>>>> public static Strings defined for the metric names to help other
> connector
> >>>>>> developers follow the same guidance. I can also add that to the
> public
> >>>>>> interface section of the FLIP if we decide to do that.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>> On Tue, Sep 8, 2020 at 9:02 PM Becket Qin <becket....@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Sep 8, 2020 at 6:55 PM Konstantin Knauf <kna...@apache.org
> >
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> Thank you for picking up this FLIP. I have a few questions:
> >>>>>>>>
> >>>>>>>> * two thoughts on naming:
> >>>>>>>>    * idleTime: In the meantime, a similar metric
> >>>>>>>> "idleTimeMsPerSecond" has
> >>>>>>>> been introduced in
> >>>>>>>> https://issues.apache.org/jira/browse/FLINK-16864. They
> >>>>>>>> have a similar name, but different definitions of idleness,
> >>>>>>>> e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it
> is
> >>>>>>>> backpressured. Can we make it clearer that these two metrics mean
> >>>>>>>> different
> >>>>>>>> things?
> >>>>>>>>
> >>>>>>>>   * "current(Fetch)Latency" I am wondering if
> >>>>>>>> "eventTimeLag(Before|After)"
> >>>>>>>> is more descriptive/clear. What do others think?
> >>>>>>>>
> >>>>>>>>   * Current(Fetch)Latency implies that the timestamps are directly
> >>>>>>>> extracted in the source connector, right? Will this be the default
> >>>>>>>> for
> >>>>>>>> FLIP-27 sources anyway?
> >>>>>>>>
> >>>>>>>> * Does FLIP-33 also include the implementation of these metrics
> (to
> >>>>>>>> the
> >>>>>>>> extent possible) for all connectors currently available in Apache
> >>>>>>>> Flink or
> >>>>>>>> is the "per-connector implementation" out-of-scope?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Konstantin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Sep 4, 2020 at 4:56 PM Becket Qin <becket....@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> > Hi all,
> >>>>>>>> >
> >>>>>>>> > To complete the Source refactoring work, I'd like to revive this
> >>>>>>>> > discussion. Since the mail thread has been dormant for more than
> >>>>>>>> a year,
> >>>>>>>> > just to recap the motivation of the FLIP:
> >>>>>>>> >
> >>>>>>>> > 1. The FLIP proposes to standardize the connector metrics by
> >>>>>>>> giving
> >>>>>>>> > guidance on the metric specifications, including the name, type
> >>>>>>>> and meaning
> >>>>>>>> > of the metrics.
> >>>>>>>> > 2. It is OK for a connector to not emit some of the metrics in
> >>>>>>>> the metric
> >>>>>>>> > guidance, but if a metric of the same semantic is emitted, the
> >>>>>>>> > implementation should follow the guidance.
> >>>>>>>> > 3. It is OK for a connector to emit more metrics than what are
> >>>>>>>> listed in
> >>>>>>>> > the FLIP. This includes having an alias for a metric specified
> in
> >>>>>>>> the FLIP.
> >>>>>>>> > 4. We will implement some of the metrics out of the box in the
> >>>>>>>> default
> >>>>>>>> > implementation of FLIP-27, as long as it is applicable.
> >>>>>>>> >
> >>>>>>>> > The FLIP wiki is following:
> >>>>>>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33
> >>>>>>>> > %3A+Standardize+Connector+Metrics
> >>>>>>>> >
> >>>>>>>> > Any thoughts?
> >>>>>>>> >
> >>>>>>>> > Thanks,
> >>>>>>>> >
> >>>>>>>> > Jiangjie (Becket) Qin
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> > On Fri, Jun 14, 2019 at 2:29 PM Piotr Nowojski <
> >>>>>>>> pi...@ververica.com>
> >>>>>>>> > wrote:
> >>>>>>>> >
> >>>>>>>> > > > we will need to revisit the convention list and adjust them
> >>>>>>>> accordingly
> >>>>>>>> > > when FLIP-27 is ready
> >>>>>>>> > >
> >>>>>>>> > >
> >>>>>>>> > > Yes, this sounds good :)
> >>>>>>>> > >
> >>>>>>>> > > Piotrek
> >>>>>>>> > >
> >>>>>>>> > > > On 13 Jun 2019, at 13:35, Becket Qin <becket....@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>> > > >
> >>>>>>>> > > > Hi Piotr,
> >>>>>>>> > > >
> >>>>>>>> > > > That's great to know. Chances are that we will need to
> >>>>>>>> revisit the
> >>>>>>>> > > > convention list and adjust them accordingly when FLIP-27 is
> >>>>>>>> ready, At
> >>>>>>>> > > that
> >>>>>>>> > > > point we can mark some of the metrics as available by
> default
> >>>>>>>> for
> >>>>>>>> > > > connectors implementing the new interface.
> >>>>>>>> > > >
> >>>>>>>> > > > Thanks,
> >>>>>>>> > > >
> >>>>>>>> > > > Jiangjie (Becket) Qin
> >>>>>>>> > > >
> >>>>>>>> > > > On Thu, Jun 13, 2019 at 3:51 PM Piotr Nowojski <
> >>>>>>>> pi...@ververica.com>
> >>>>>>>> > > wrote:
> >>>>>>>> > > >
> >>>>>>>> > > >> Thanks for driving this. I’ve just noticed one small thing.
> >>>>>>>> With new
> >>>>>>>> > > >> SourceReader interface Flink will be able to provide
> >>>>>>>> `idleTime` metric
> >>>>>>>> > > >> automatically.
> >>>>>>>> > > >>
> >>>>>>>> > > >> Piotrek
> >>>>>>>> > > >>
> >>>>>>>> > > >>> On 13 Jun 2019, at 03:30, Becket Qin <
> becket....@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>> > > >>>
> >>>>>>>> > > >>> Thanks all for the feedback and discussion.
> >>>>>>>> > > >>>
> >>>>>>>> > > >>> Since there wasn't any concern raised, I've started the
> >>>>>>>> voting thread
> >>>>>>>> > > for
> >>>>>>>> > > >>> this FLIP, but please feel free to continue the discussion
> >>>>>>>> here if
> >>>>>>>> > you
> >>>>>>>> > > >>> think something still needs to be addressed.
> >>>>>>>> > > >>>
> >>>>>>>> > > >>> Thanks,
> >>>>>>>> > > >>>
> >>>>>>>> > > >>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>
> >>>>>>>> > > >>>
> >>>>>>>> > > >>>
> >>>>>>>> > > >>> On Mon, Jun 10, 2019 at 9:10 AM Becket Qin <
> >>>>>>>> becket....@gmail.com>
> >>>>>>>> > > wrote:
> >>>>>>>> > > >>>
> >>>>>>>> > > >>>> Hi Piotr,
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>> Thanks for the comments. Yes, you are right. Users will
> >>>>>>>> have to look
> >>>>>>>> > > at
> >>>>>>>> > > >>>> other metrics to decide whether the pipeline is healthy
> or
> >>>>>>>> not in
> >>>>>>>> > the
> >>>>>>>> > > >> first
> >>>>>>>> > > >>>> place before they can use the time-based metric to fix
> the
> >>>>>>>> > bottleneck.
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>> I agree that once we have FLIP-27 ready, some of the
> >>>>>>>> metrics can
> >>>>>>>> > just
> >>>>>>>> > > be
> >>>>>>>> > > >>>> reported by the abstract implementation.
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>> I've updated FLIP-33 wiki page to add the pendingBytes
> and
> >>>>>>>> > > >> pendingRecords
> >>>>>>>> > > >>>> metric. Please let me know if you have any concern over
> >>>>>>>> the updated
> >>>>>>>> > > >> metric
> >>>>>>>> > > >>>> convention proposal.
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen
> >>>>>>>> > > >>>> <step...@ververica.com> will you also have time to take
> a
> >>>>>>>> look at
> >>>>>>>> > the
> >>>>>>>> > > >>>> proposed metric convention? If there is no further
> concern
> >>>>>>>> I'll
> >>>>>>>> > start
> >>>>>>>> > > a
> >>>>>>>> > > >>>> voting thread for this FLIP in two days.
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>> Thanks,
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski <
> >>>>>>>> pi...@ververica.com>
> >>>>>>>> > > >> wrote:
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>>> Hi Becket,
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> Thanks for the answer :)
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>>> By time-based metric, I meant the portion of time spent
> >>>>>>>> on
> >>>>>>>> > producing
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>> record to downstream. For example, a source connector
> >>>>>>>> can report
> >>>>>>>> > > that
> >>>>>>>> > > >>>>> it's
> >>>>>>>> > > >>>>>> spending 80% of time to emit record to downstream
> >>>>>>>> processing
> >>>>>>>> > > pipeline.
> >>>>>>>> > > >>>>> In
> >>>>>>>> > > >>>>>> another case, a sink connector may report that its
> >>>>>>>> spending 30% of
> >>>>>>>> > > >> time
> >>>>>>>> > > >>>>>> producing the records to the external system.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> This is in some sense equivalent to the buffer usage
> >>>>>>>> metric:
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>>> - 80% of time spent on emitting records to downstream
> >>>>>>>> --->
> >>>>>>>> > > downstream
> >>>>>>>> > > >>>>>> node is bottleneck ---> output buffer is probably full.
> >>>>>>>> > > >>>>>> - 30% of time spent on emitting records to downstream
> >>>>>>>> --->
> >>>>>>>> > > downstream
> >>>>>>>> > > >>>>>> node is not bottleneck ---> output buffer is probably
> >>>>>>>> not full.
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> If by “time spent on emitting records to downstream” you
> >>>>>>>> understand
> >>>>>>>> > > >>>>> “waiting on back pressure”, then I see your point. And I
> >>>>>>>> agree that
> >>>>>>>> > > >> some
> >>>>>>>> > > >>>>> kind of ratio/time based metric gives you more
> >>>>>>>> information. However
> >>>>>>>> > > >> under
> >>>>>>>> > > >>>>> “time spent on emitting records to downstream” might be
> >>>>>>>> hidden the
> >>>>>>>> > > >>>>> following (extreme) situation:
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> 1. Job is barely able to handle influx of records, there
> >>>>>>>> is 99%
> >>>>>>>> > > >>>>> CPU/resource usage in the cluster, but nobody is
> >>>>>>>> > > >>>>> bottlenecked/backpressured, all output buffers are
> empty,
> >>>>>>>> everybody
> >>>>>>>> > > is
> >>>>>>>> > > >>>>> waiting in 1% of it’s time for more records to process.
> >>>>>>>> > > >>>>> 2. 80% time can still be spent on "down stream
> >>>>>>>> operators”, because
> >>>>>>>> > > they
> >>>>>>>> > > >>>>> are the CPU intensive operations, but this doesn’t mean
> >>>>>>>> that
> >>>>>>>> > > >> increasing the
> >>>>>>>> > > >>>>> parallelism down the stream will help with anything
> >>>>>>>> there. To the
> >>>>>>>> > > >> contrary,
> >>>>>>>> > > >>>>> increasing parallelism of the source operator might help
> >>>>>>>> to
> >>>>>>>> > increase
> >>>>>>>> > > >>>>> resource utilisation up to 100%.
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> However, this “time based/ratio” approach can be
> extended
> >>>>>>>> to
> >>>>>>>> > > in/output
> >>>>>>>> > > >>>>> buffer usage. Besides collecting an information that
> >>>>>>>> input/output
> >>>>>>>> > > >> buffer is
> >>>>>>>> > > >>>>> full/empty, we can probe profile how often are buffer
> >>>>>>>> empty/full.
> >>>>>>>> > If
> >>>>>>>> > > >> output
> >>>>>>>> > > >>>>> buffer is full 1% of times, there is almost no back
> >>>>>>>> pressure. If
> >>>>>>>> > it’s
> >>>>>>>> > > >> full
> >>>>>>>> > > >>>>> 80% of times, there is some back pressure, if it’s full
> >>>>>>>> 99.9% of
> >>>>>>>> > > times,
> >>>>>>>> > > >>>>> there is huge back pressure.
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> Now for autoscaling you could compare the input & output
> >>>>>>>> buffers
> >>>>>>>> > fill
> >>>>>>>> > > >>>>> ratio:
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> 1. Both are high, the source of bottleneck is down the
> >>>>>>>> stream
> >>>>>>>> > > >>>>> 2. Output is low, input is high, this is the bottleneck
> >>>>>>>> and the
> >>>>>>>> > > higher
> >>>>>>>> > > >>>>> the difference, the bigger source of bottleneck is this
> is
> >>>>>>>> > > >> operator/task
> >>>>>>>> > > >>>>> 3. Output is high, input is low - there was some load
> >>>>>>>> spike that we
> >>>>>>>> > > are
> >>>>>>>> > > >>>>> currently finishing to process
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> But long story short, we are probably diverging from the
> >>>>>>>> topic of
> >>>>>>>> > > this
> >>>>>>>> > > >>>>> discussion, and we can discuss this at some later point.
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> For now, for sources:
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> as I wrote before, +1 for:
> >>>>>>>> > > >>>>> - pending.bytes, Gauge
> >>>>>>>> > > >>>>> - pending.messages, Gauge
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> When we will be developing/discussing SourceReader from
> >>>>>>>> FLIP-27 we
> >>>>>>>> > > >> might
> >>>>>>>> > > >>>>> then add:
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> - in-memory.buffer.usage (0 - 100%)
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> Which will be estimated automatically by Flink while
> user
> >>>>>>>> will be
> >>>>>>>> > > able
> >>>>>>>> > > >> to
> >>>>>>>> > > >>>>> override/provide better estimation.
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>> Piotrek
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>>> On 5 Jun 2019, at 05:42, Becket Qin <
> >>>>>>>> becket....@gmail.com> wrote:
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> Hi Piotr,
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> Thanks for the explanation. Please see some
> >>>>>>>> clarifications below.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> By time-based metric, I meant the portion of time spent
> >>>>>>>> on
> >>>>>>>> > producing
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>> record to downstream. For example, a source connector
> >>>>>>>> can report
> >>>>>>>> > > that
> >>>>>>>> > > >>>>> it's
> >>>>>>>> > > >>>>>> spending 80% of time to emit record to downstream
> >>>>>>>> processing
> >>>>>>>> > > pipeline.
> >>>>>>>> > > >>>>> In
> >>>>>>>> > > >>>>>> another case, a sink connector may report that its
> >>>>>>>> spending 30% of
> >>>>>>>> > > >> time
> >>>>>>>> > > >>>>>> producing the records to the external system.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> This is in some sense equivalent to the buffer usage
> >>>>>>>> metric:
> >>>>>>>> > > >>>>>> - 80% of time spent on emitting records to downstream
> >>>>>>>> --->
> >>>>>>>> > > downstream
> >>>>>>>> > > >>>>>> node is bottleneck ---> output buffer is probably full.
> >>>>>>>> > > >>>>>> - 30% of time spent on emitting records to downstream
> >>>>>>>> --->
> >>>>>>>> > > downstream
> >>>>>>>> > > >>>>>> node is not bottleneck ---> output buffer is probably
> >>>>>>>> not full.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> However, the time-based metric has a few advantages
> that
> >>>>>>>> the
> >>>>>>>> > buffer
> >>>>>>>> > > >>>>> usage
> >>>>>>>> > > >>>>>> metric may not have.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> 1.  Buffer usage metric may not be applicable to all
> the
> >>>>>>>> connector
> >>>>>>>> > > >>>>>> implementations, while reporting time-based metric are
> >>>>>>>> always
> >>>>>>>> > > doable.
> >>>>>>>> > > >>>>>> Some source connectors may not have any input buffer,
> or
> >>>>>>>> they may
> >>>>>>>> > > use
> >>>>>>>> > > >>>>> some
> >>>>>>>> > > >>>>>> third party library that does not expose the input
> >>>>>>>> buffer at all.
> >>>>>>>> > > >>>>>> Similarly, for sink connectors, the implementation may
> >>>>>>>> not have
> >>>>>>>> > any
> >>>>>>>> > > >>>>> output
> >>>>>>>> > > >>>>>> buffer, or the third party library does not expose such
> >>>>>>>> buffer.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> 2. Although both type of metrics can detect bottleneck,
> >>>>>>>> time-based
> >>>>>>>> > > >>>>> metrics
> >>>>>>>> > > >>>>>> can be used to generate a more informed action to
> remove
> >>>>>>>> the
> >>>>>>>> > > >> bottleneck.
> >>>>>>>> > > >>>>>> For example, when the downstream is bottleneck, the
> >>>>>>>> output buffer
> >>>>>>>> > > >> usage
> >>>>>>>> > > >>>>>> metric is likely to be 100%, and the input buffer usage
> >>>>>>>> might be
> >>>>>>>> > 0%.
> >>>>>>>> > > >>>>> That
> >>>>>>>> > > >>>>>> means we don't know what is the suitable parallelism to
> >>>>>>>> lift the
> >>>>>>>> > > >>>>>> bottleneck. The time-based metric, on the other hand,
> >>>>>>>> would give
> >>>>>>>> > > >> useful
> >>>>>>>> > > >>>>>> information, e.g. if 80% of time was spent on emitting
> >>>>>>>> records, we
> >>>>>>>> > > can
> >>>>>>>> > > >>>>>> roughly increase the downstream node parallelism by 4
> >>>>>>>> times.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> Admittedly, the time-based metrics are more expensive
> >>>>>>>> than buffer
> >>>>>>>> > > >>>>> usage. So
> >>>>>>>> > > >>>>>> we may have to do some sampling to reduce the cost. But
> >>>>>>>> in
> >>>>>>>> > general,
> >>>>>>>> > > >>>>> using
> >>>>>>>> > > >>>>>> time-based metrics seems worth adding.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> That being said, I don't think buffer usage metric and
> >>>>>>>> time-based
> >>>>>>>> > > >>>>> metrics
> >>>>>>>> > > >>>>>> are mutually exclusive. We can probably have both. It
> is
> >>>>>>>> just that
> >>>>>>>> > > in
> >>>>>>>> > > >>>>>> practice, features like auto-scaling might prefer
> >>>>>>>> time-based
> >>>>>>>> > metrics
> >>>>>>>> > > >> for
> >>>>>>>> > > >>>>>> the reason stated above.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>>> 1. Define the metrics that would allow us to manually
> >>>>>>>> detect
> >>>>>>>> > > >>>>> bottlenecks.
> >>>>>>>> > > >>>>>> As I wrote, we already have them in most of the places,
> >>>>>>>> except of
> >>>>>>>> > > >>>>>> sources/sinks.
> >>>>>>>> > > >>>>>>> 2. Use those metrics, to automatically detect
> >>>>>>>> bottlenecks.
> >>>>>>>> > > Currently
> >>>>>>>> > > >> we
> >>>>>>>> > > >>>>>> are only automatically detecting back pressure and
> >>>>>>>> reporting it to
> >>>>>>>> > > the
> >>>>>>>> > > >>>>> user
> >>>>>>>> > > >>>>>> in web UI (is it exposed as a metric at all?).
> Detecting
> >>>>>>>> the root
> >>>>>>>> > > >> cause
> >>>>>>>> > > >>>>> of
> >>>>>>>> > > >>>>>> the back pressure (bottleneck) is one step further.
> >>>>>>>> > > >>>>>>> 3. Use the knowledge about where exactly the
> bottleneck
> >>>>>>>> is
> >>>>>>>> > located,
> >>>>>>>> > > >> to
> >>>>>>>> > > >>>>>> try to do something with it.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> As explained above, I think time-based metric also
> >>>>>>>> addresses item
> >>>>>>>> > 1
> >>>>>>>> > > >> and
> >>>>>>>> > > >>>>>> item 2.
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> Any thoughts？
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> Thanks,
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski <
> >>>>>>>> > pi...@ververica.com>
> >>>>>>>> > > >>>>> wrote:
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>>> Hi again :)
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>>> - pending.bytes, Gauge
> >>>>>>>> > > >>>>>>>> - pending.messages, Gauge
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> +1
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> And true, instead of overloading one of the metric it
> >>>>>>>> is better
> >>>>>>>> > > when
> >>>>>>>> > > >>>>> user
> >>>>>>>> > > >>>>>>> can choose to provide only one of them.
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> Re 2:
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>>> If I understand correctly, this metric along with the
> >>>>>>>> pending
> >>>>>>>> > > >> mesages
> >>>>>>>> > > >>>>> /
> >>>>>>>> > > >>>>>>>> bytes would answer the questions of:
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>>> - Does the connector consume fast enough? Lagging
> >>>>>>>> behind + empty
> >>>>>>>> > > >>>>> buffer
> >>>>>>>> > > >>>>>>> =
> >>>>>>>> > > >>>>>>>> cannot consume fast enough.
> >>>>>>>> > > >>>>>>>> - Does the connector emit fast enough? Lagging behind
> >>>>>>>> + full
> >>>>>>>> > > buffer
> >>>>>>>> > > >> =
> >>>>>>>> > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is
> >>>>>>>> slow.
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> Yes, exactly. This can also be used to support
> >>>>>>>> decisions like
> >>>>>>>> > > >> changing
> >>>>>>>> > > >>>>> the
> >>>>>>>> > > >>>>>>> parallelism of the sources and/or down stream
> operators.
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> I’m not sure if I understand your proposal with time
> >>>>>>>> based
> >>>>>>>> > > >>>>> measurements.
> >>>>>>>> > > >>>>>>> Maybe I’m missing something, but I do not see how
> >>>>>>>> measuring time
> >>>>>>>> > > >> alone
> >>>>>>>> > > >>>>>>> could answer the problem: where is the bottleneck.
> Time
> >>>>>>>> spent on
> >>>>>>>> > > the
> >>>>>>>> > > >>>>>>> next/emit might be short or long (depending on how
> >>>>>>>> heavy to
> >>>>>>>> > process
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>>> record is) and the source can still be
> bottlenecked/back
> >>>>>>>> > pressured
> >>>>>>>> > > or
> >>>>>>>> > > >>>>> not.
> >>>>>>>> > > >>>>>>> Usually the easiest and the most reliable way how to
> >>>>>>>> detect
> >>>>>>>> > > >>>>> bottlenecks is
> >>>>>>>> > > >>>>>>> by checking usage of input & output buffers, since
> when
> >>>>>>>> input
> >>>>>>>> > > buffer
> >>>>>>>> > > >> is
> >>>>>>>> > > >>>>>>> full while output buffer is empty, that’s the
> >>>>>>>> definition of a
> >>>>>>>> > > >>>>> bottleneck.
> >>>>>>>> > > >>>>>>> Also this is usually very easy and cheap to measure
> (it
> >>>>>>>> works
> >>>>>>>> > > >>>>> effectively
> >>>>>>>> > > >>>>>>> the same way as current’s Flink back pressure
> >>>>>>>> monitoring, but
> >>>>>>>> > more
> >>>>>>>> > > >>>>> cleanly,
> >>>>>>>> > > >>>>>>> without probing thread’s stack traces).
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> Also keep in mind that we are already using the buffer
> >>>>>>>> usage
> >>>>>>>> > > metrics
> >>>>>>>> > > >>>>> for
> >>>>>>>> > > >>>>>>> detecting the bottlenecks in Flink’s internal network
> >>>>>>>> exchanges
> >>>>>>>> > > >> (manual
> >>>>>>>> > > >>>>>>> work). That’s the reason why I wanted to extend this
> to
> >>>>>>>> > > >> sources/sinks,
> >>>>>>>> > > >>>>>>> since they are currently our blind spot.
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>>> One feature we are currently working on to scale
> Flink
> >>>>>>>> > > automatically
> >>>>>>>> > > >>>>>>> relies
> >>>>>>>> > > >>>>>>>> on some metrics answering the same question
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> That would be very helpful feature. I think in order
> to
> >>>>>>>> achieve
> >>>>>>>> > > that
> >>>>>>>> > > >> we
> >>>>>>>> > > >>>>>>> would need to:
> >>>>>>>> > > >>>>>>> 1. Define the metrics that would allow us to manually
> >>>>>>>> detect
> >>>>>>>> > > >>>>> bottlenecks.
> >>>>>>>> > > >>>>>>> As I wrote, we already have them in most of the
> places,
> >>>>>>>> except of
> >>>>>>>> > > >>>>>>> sources/sinks.
> >>>>>>>> > > >>>>>>> 2. Use those metrics, to automatically detect
> >>>>>>>> bottlenecks.
> >>>>>>>> > > Currently
> >>>>>>>> > > >> we
> >>>>>>>> > > >>>>>>> are only automatically detecting back pressure and
> >>>>>>>> reporting it
> >>>>>>>> > to
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>> user
> >>>>>>>> > > >>>>>>> in web UI (is it exposed as a metric at all?).
> >>>>>>>> Detecting the root
> >>>>>>>> > > >>>>> cause of
> >>>>>>>> > > >>>>>>> the back pressure (bottleneck) is one step further.
> >>>>>>>> > > >>>>>>> 3. Use the knowledge about where exactly the
> bottleneck
> >>>>>>>> is
> >>>>>>>> > located,
> >>>>>>>> > > >> to
> >>>>>>>> > > >>>>> try
> >>>>>>>> > > >>>>>>> to do something with it.
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> I think you are aiming for point 3., but before we
> >>>>>>>> reach it, we
> >>>>>>>> > are
> >>>>>>>> > > >>>>> still
> >>>>>>>> > > >>>>>>> missing 1. & 2. Also even if we have 3., there is a
> >>>>>>>> value in 1 &
> >>>>>>>> > 2
> >>>>>>>> > > >> for
> >>>>>>>> > > >>>>>>> manual analysis/dashboards.
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> However, having the knowledge of where the bottleneck
> >>>>>>>> is, doesn’t
> >>>>>>>> > > >>>>>>> necessarily mean that we know what we can do about it.
> >>>>>>>> For
> >>>>>>>> > example
> >>>>>>>> > > >>>>>>> increasing parallelism might or might not help with
> >>>>>>>> anything
> >>>>>>>> > (data
> >>>>>>>> > > >>>>> skew,
> >>>>>>>> > > >>>>>>> bottleneck on some resource that does not scale), but
> >>>>>>>> this remark
> >>>>>>>> > > >>>>> applies
> >>>>>>>> > > >>>>>>> always, regardless of the way how did we detect the
> >>>>>>>> bottleneck.
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>> Piotrek
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>>> On 3 Jun 2019, at 06:16, Becket Qin <
> >>>>>>>> becket....@gmail.com>
> >>>>>>>> > wrote:
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> Hi Piotr,
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> Thanks for the suggestion. Some thoughts below:
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> Re 1: The pending messages / bytes.
> >>>>>>>> > > >>>>>>>> I completely agree these are very useful metrics and
> >>>>>>>> we should
> >>>>>>>> > > >> expect
> >>>>>>>> > > >>>>> the
> >>>>>>>> > > >>>>>>>> connector to report. WRT the way to expose them, it
> >>>>>>>> seems more
> >>>>>>>> > > >>>>> consistent
> >>>>>>>> > > >>>>>>>> to add two metrics instead of adding a method (unless
> >>>>>>>> there are
> >>>>>>>> > > >> other
> >>>>>>>> > > >>>>> use
> >>>>>>>> > > >>>>>>>> cases other than metric reporting). So we can add the
> >>>>>>>> following
> >>>>>>>> > > two
> >>>>>>>> > > >>>>>>> metrics.
> >>>>>>>> > > >>>>>>>> - pending.bytes, Gauge
> >>>>>>>> > > >>>>>>>> - pending.messages, Gauge
> >>>>>>>> > > >>>>>>>> Applicable connectors can choose to report them.
> These
> >>>>>>>> two
> >>>>>>>> > metrics
> >>>>>>>> > > >>>>> along
> >>>>>>>> > > >>>>>>>> with latency should be sufficient for users to
> >>>>>>>> understand the
> >>>>>>>> > > >> progress
> >>>>>>>> > > >>>>>>> of a
> >>>>>>>> > > >>>>>>>> connector.
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> Re 2: Number of buffered data in-memory of the
> >>>>>>>> connector
> >>>>>>>> > > >>>>>>>> If I understand correctly, this metric along with the
> >>>>>>>> pending
> >>>>>>>> > > >> mesages
> >>>>>>>> > > >>>>> /
> >>>>>>>> > > >>>>>>>> bytes would answer the questions of:
> >>>>>>>> > > >>>>>>>> - Does the connector consume fast enough? Lagging
> >>>>>>>> behind + empty
> >>>>>>>> > > >>>>> buffer
> >>>>>>>> > > >>>>>>> =
> >>>>>>>> > > >>>>>>>> cannot consume fast enough.
> >>>>>>>> > > >>>>>>>> - Does the connector emit fast enough? Lagging behind
> >>>>>>>> + full
> >>>>>>>> > > buffer
> >>>>>>>> > > >> =
> >>>>>>>> > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is
> >>>>>>>> slow.
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> One feature we are currently working on to scale
> Flink
> >>>>>>>> > > automatically
> >>>>>>>> > > >>>>>>> relies
> >>>>>>>> > > >>>>>>>> on some metrics answering the same question, more
> >>>>>>>> specifically,
> >>>>>>>> > we
> >>>>>>>> > > >> are
> >>>>>>>> > > >>>>>>>> profiling the time spent on .next() method (time to
> >>>>>>>> consume) and
> >>>>>>>> > > the
> >>>>>>>> > > >>>>> time
> >>>>>>>> > > >>>>>>>> spent on .collect() method (time to emit / process).
> >>>>>>>> One
> >>>>>>>> > advantage
> >>>>>>>> > > >> of
> >>>>>>>> > > >>>>>>> such
> >>>>>>>> > > >>>>>>>> method level time cost allows us to calculate the
> >>>>>>>> parallelism
> >>>>>>>> > > >>>>> required to
> >>>>>>>> > > >>>>>>>> keep up in case their is a lag.
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> However, one concern I have regarding such metric is
> >>>>>>>> that they
> >>>>>>>> > are
> >>>>>>>> > > >>>>>>>> implementation specific. Either profiling on the
> >>>>>>>> method time, or
> >>>>>>>> > > >>>>>>> reporting
> >>>>>>>> > > >>>>>>>> buffer usage assumes the connector are implemented in
> >>>>>>>> such a
> >>>>>>>> > way.
> >>>>>>>> > > A
> >>>>>>>> > > >>>>>>>> slightly better solution might be have the following
> >>>>>>>> metric:
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>>  - EmitTimeRatio (or FetchTimeRatio): The time spent
> >>>>>>>> on emitting
> >>>>>>>> > > >>>>>>>> records / Total time elapsed.
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> This assumes that the source connectors have to emit
> >>>>>>>> the records
> >>>>>>>> > > to
> >>>>>>>> > > >>>>> the
> >>>>>>>> > > >>>>>>>> downstream at some point. The emission may take some
> >>>>>>>> time ( e.g.
> >>>>>>>> > > go
> >>>>>>>> > > >>>>>>> through
> >>>>>>>> > > >>>>>>>> chained operators). And the rest of the time are
> spent
> >>>>>>>> to
> >>>>>>>> > prepare
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>>>> record to emit, including time for consuming and
> format
> >>>>>>>> > > conversion,
> >>>>>>>> > > >>>>> etc.
> >>>>>>>> > > >>>>>>>> Ideally, we'd like to see the time spent on record
> >>>>>>>> fetch and
> >>>>>>>> > emit
> >>>>>>>> > > to
> >>>>>>>> > > >>>>> be
> >>>>>>>> > > >>>>>>>> about the same, so no one is bottleneck for the
> other.
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> The downside of these time based metrics is
> additional
> >>>>>>>> overhead
> >>>>>>>> > to
> >>>>>>>> > > >> get
> >>>>>>>> > > >>>>>>> the
> >>>>>>>> > > >>>>>>>> time, therefore sampling might be needed. But in
> >>>>>>>> practice I feel
> >>>>>>>> > > >> such
> >>>>>>>> > > >>>>>>> time
> >>>>>>>> > > >>>>>>>> based metric might be more useful if we want to take
> >>>>>>>> action.
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> I think we should absolutely add metrics in (1) to
> the
> >>>>>>>> metric
> >>>>>>>> > > >>>>> convention.
> >>>>>>>> > > >>>>>>>> We could also add the metrics mentioned in (2) if we
> >>>>>>>> reach
> >>>>>>>> > > consensus
> >>>>>>>> > > >>>>> on
> >>>>>>>> > > >>>>>>>> that. What do you think?
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski <
> >>>>>>>> > > pi...@ververica.com
> >>>>>>>> > > >>>
> >>>>>>>> > > >>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>
> >>>>>>>> > > >>>>>>>>> Hey Becket,
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> Re 1a) and 1b) +1 from my side.
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> I’ve discussed this issue:
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow
> us
> >>>>>>>> to check
> >>>>>>>> > > the
> >>>>>>>> > > >>>>> cause
> >>>>>>>> > > >>>>>>>>> of
> >>>>>>>> > > >>>>>>>>>>>> back pressure:
> >>>>>>>> > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes?
> >>>>>>>> Or boolean
> >>>>>>>> > > >>>>>>>>>>>> hasSomethingl/isEmpty)
> >>>>>>>> > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes?
> Or
> >>>>>>>> boolean
> >>>>>>>> > > >>>>>>>>>>>> hasSomething/isEmpty)
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> With Nico at some lengths and he also saw the
> >>>>>>>> benefits of them.
> >>>>>>>> > > We
> >>>>>>>> > > >>>>> also
> >>>>>>>> > > >>>>>>>>> have more concrete proposal for that.
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> Actually there are two really useful metrics, that
> we
> >>>>>>>> are
> >>>>>>>> > missing
> >>>>>>>> > > >>>>>>>>> currently:
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> 1. Number of data/records/bytes in the backlog to
> >>>>>>>> process. For
> >>>>>>>> > > >>>>> example
> >>>>>>>> > > >>>>>>>>> remaining number of bytes in unread files. Or
> pending
> >>>>>>>> data in
> >>>>>>>> > > Kafka
> >>>>>>>> > > >>>>>>> topics.
> >>>>>>>> > > >>>>>>>>> 2. Number of buffered data in-memory of the
> >>>>>>>> connector, that are
> >>>>>>>> > > >>>>> waiting
> >>>>>>>> > > >>>>>>> to
> >>>>>>>> > > >>>>>>>>> be processed pushed to Flink pipeline.
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> Re 1:
> >>>>>>>> > > >>>>>>>>> This would have to be a metric provided directly by
> a
> >>>>>>>> > connector.
> >>>>>>>> > > It
> >>>>>>>> > > >>>>>>> could
> >>>>>>>> > > >>>>>>>>> be an undefined `int`:
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> `int backlog` - estimate of pending work.
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> “Undefined” meaning that it would be up to a
> >>>>>>>> connector to
> >>>>>>>> > decided
> >>>>>>>> > > >>>>>>> whether
> >>>>>>>> > > >>>>>>>>> it’s measured in bytes, records, pending files or
> >>>>>>>> whatever it
> >>>>>>>> > is
> >>>>>>>> > > >>>>>>> possible
> >>>>>>>> > > >>>>>>>>> to provide by the connector. This is because I
> assume
> >>>>>>>> not every
> >>>>>>>> > > >>>>>>> connector
> >>>>>>>> > > >>>>>>>>> can provide exact number and for some of them it
> >>>>>>>> might be
> >>>>>>>> > > >> impossible
> >>>>>>>> > > >>>>> to
> >>>>>>>> > > >>>>>>>>> provide records number of bytes count.
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> Re 2:
> >>>>>>>> > > >>>>>>>>> This metric could be either provided by a connector,
> >>>>>>>> or
> >>>>>>>> > > calculated
> >>>>>>>> > > >>>>>>> crudely
> >>>>>>>> > > >>>>>>>>> by Flink:
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> `float bufferUsage` - value from [0.0, 1.0] range
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> Percentage of used in memory buffers, like in
> Kafka’s
> >>>>>>>> handover.
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> It could be crudely implemented by Flink with
> FLIP-27
> >>>>>>>> > > >>>>>>>>> SourceReader#isAvailable. If SourceReader is not
> >>>>>>>> available
> >>>>>>>> > > reported
> >>>>>>>> > > >>>>>>>>> `bufferUsage` could be 0.0. If it is available, it
> >>>>>>>> could be
> >>>>>>>> > 1.0.
> >>>>>>>> > > I
> >>>>>>>> > > >>>>> think
> >>>>>>>> > > >>>>>>>>> this would be a good enough estimation for most of
> >>>>>>>> the use
> >>>>>>>> > cases
> >>>>>>>> > > >>>>> (that
> >>>>>>>> > > >>>>>>>>> could be overloaded and implemented better if
> >>>>>>>> desired).
> >>>>>>>> > > Especially
> >>>>>>>> > > >>>>>>> since we
> >>>>>>>> > > >>>>>>>>> are reporting only probed values: if probed values
> >>>>>>>> are almost
> >>>>>>>> > > >> always
> >>>>>>>> > > >>>>>>> “1.0”,
> >>>>>>>> > > >>>>>>>>> it would mean that we have a back pressure. If they
> >>>>>>>> are almost
> >>>>>>>> > > >> always
> >>>>>>>> > > >>>>>>>>> “0.0”, there is probably no back pressure at the
> >>>>>>>> sources.
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> What do you think about this?
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>> Piotrek
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> On 30 May 2019, at 11:41, Becket Qin <
> >>>>>>>> becket....@gmail.com>
> >>>>>>>> > > >> wrote:
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> Hi all,
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> Thanks a lot for all the feedback and comments. I'd
> >>>>>>>> like to
> >>>>>>>> > > >> continue
> >>>>>>>> > > >>>>>>> the
> >>>>>>>> > > >>>>>>>>>> discussion on this FLIP.
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> I updated the FLIP-33 wiki to remove all the
> >>>>>>>> histogram metrics
> >>>>>>>> > > >> from
> >>>>>>>> > > >>>>> the
> >>>>>>>> > > >>>>>>>>>> first version of metric convention due to the
> >>>>>>>> performance
> >>>>>>>> > > concern.
> >>>>>>>> > > >>>>> The
> >>>>>>>> > > >>>>>>>>> plan
> >>>>>>>> > > >>>>>>>>>> is to introduce them later when we have a mechanism
> >>>>>>>> to opt
> >>>>>>>> > > in/out
> >>>>>>>> > > >>>>>>>>> metrics.
> >>>>>>>> > > >>>>>>>>>> At that point, users can decide whether they want
> to
> >>>>>>>> pay the
> >>>>>>>> > > cost
> >>>>>>>> > > >> to
> >>>>>>>> > > >>>>>>> get
> >>>>>>>> > > >>>>>>>>>> the metric or not.
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> As Stephan suggested, for this FLIP, let's first
> try
> >>>>>>>> to agree
> >>>>>>>> > on
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>>>>> small
> >>>>>>>> > > >>>>>>>>>> list of conventional metrics that connectors should
> >>>>>>>> follow.
> >>>>>>>> > > >>>>>>>>>> Just to be clear, the purpose of the convention is
> >>>>>>>> not to
> >>>>>>>> > > enforce
> >>>>>>>> > > >>>>> every
> >>>>>>>> > > >>>>>>>>>> connector to report all these metrics, but to
> >>>>>>>> provide a
> >>>>>>>> > guidance
> >>>>>>>> > > >> in
> >>>>>>>> > > >>>>>>> case
> >>>>>>>> > > >>>>>>>>>> these metrics are reported by some connectors.
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> @ Stephan & Chesnay,
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> Regarding the duplication of `RecordsIn` metric in
> >>>>>>>> operator /
> >>>>>>>> > > task
> >>>>>>>> > > >>>>>>>>>> IOMetricGroups, from what I understand, for source
> >>>>>>>> operator,
> >>>>>>>> > it
> >>>>>>>> > > is
> >>>>>>>> > > >>>>>>>>> actually
> >>>>>>>> > > >>>>>>>>>> the SourceFunction that reports the operator level
> >>>>>>>> > > >>>>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are
> >>>>>>>> essentially
> >>>>>>>> > the
> >>>>>>>> > > >>>>> same
> >>>>>>>> > > >>>>>>>>>> metric in the operator level IOMetricGroup.
> >>>>>>>> Similarly for the
> >>>>>>>> > > Sink
> >>>>>>>> > > >>>>>>>>>> operator, the operator level
> >>>>>>>> RecordsOut/RecordsOutPerSecond
> >>>>>>>> > > >> metrics
> >>>>>>>> > > >>>>> are
> >>>>>>>> > > >>>>>>>>>> also reported by the Sink function. I marked them
> as
> >>>>>>>> existing
> >>>>>>>> > in
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>>>>>> FLIP-33 wiki page. Please let me know if I
> >>>>>>>> misunderstood.
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin <
> >>>>>>>> > > becket....@gmail.com>
> >>>>>>>> > > >>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>> Hi Piotr,
> >>>>>>>> > > >>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>> > > >>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>> 1a) I guess you are referring to the part that
> >>>>>>>> "original
> >>>>>>>> > system
> >>>>>>>> > > >>>>>>> specific
> >>>>>>>> > > >>>>>>>>>>> metrics should also be reported". The performance
> >>>>>>>> impact
> >>>>>>>> > > depends
> >>>>>>>> > > >> on
> >>>>>>>> > > >>>>>>> the
> >>>>>>>> > > >>>>>>>>>>> implementation. An efficient implementation would
> >>>>>>>> only record
> >>>>>>>> > > the
> >>>>>>>> > > >>>>>>> metric
> >>>>>>>> > > >>>>>>>>>>> once, but report them with two different metric
> >>>>>>>> names. This
> >>>>>>>> > is
> >>>>>>>> > > >>>>>>> unlikely
> >>>>>>>> > > >>>>>>>>> to
> >>>>>>>> > > >>>>>>>>>>> hurt performance.
> >>>>>>>> > > >>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>> 1b) Yes, I agree that we should avoid adding
> >>>>>>>> overhead to the
> >>>>>>>> > > >>>>> critical
> >>>>>>>> > > >>>>>>>>> path
> >>>>>>>> > > >>>>>>>>>>> by all means. This is sometimes a tradeoff,
> running
> >>>>>>>> blindly
> >>>>>>>> > > >> without
> >>>>>>>> > > >>>>>>> any
> >>>>>>>> > > >>>>>>>>>>> metric gives best performance, but sometimes might
> >>>>>>>> be
> >>>>>>>> > > frustrating
> >>>>>>>> > > >>>>> when
> >>>>>>>> > > >>>>>>>>> we
> >>>>>>>> > > >>>>>>>>>>> debug some issues.
> >>>>>>>> > > >>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>> 2. The metrics are indeed very useful. Are they
> >>>>>>>> supposed to
> >>>>>>>> > be
> >>>>>>>> > > >>>>>>> reported
> >>>>>>>> > > >>>>>>>>> by
> >>>>>>>> > > >>>>>>>>>>> the connectors or Flink itself? At this point
> >>>>>>>> FLIP-33 is more
> >>>>>>>> > > >>>>> focused
> >>>>>>>> > > >>>>>>> on
> >>>>>>>> > > >>>>>>>>>>> provide a guidance to the connector authors on the
> >>>>>>>> metrics
> >>>>>>>> > > >>>>> reporting.
> >>>>>>>> > > >>>>>>>>> That
> >>>>>>>> > > >>>>>>>>>>> said, after FLIP-27, I think we should absolutely
> >>>>>>>> report
> >>>>>>>> > these
> >>>>>>>> > > >>>>> metrics
> >>>>>>>> > > >>>>>>>>> in
> >>>>>>>> > > >>>>>>>>>>> the abstract implementation. In any case, the
> metric
> >>>>>>>> > convention
> >>>>>>>> > > >> in
> >>>>>>>> > > >>>>>>> this
> >>>>>>>> > > >>>>>>>>>>> list are expected to evolve over time.
> >>>>>>>> > > >>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski <
> >>>>>>>> > > >>>>> pi...@ververica.com>
> >>>>>>>> > > >>>>>>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> Hi,
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> Thanks for the proposal and driving the effort
> >>>>>>>> here Becket
> >>>>>>>> > :)
> >>>>>>>> > > >> I’ve
> >>>>>>>> > > >>>>>>> read
> >>>>>>>> > > >>>>>>>>>>>> through the FLIP-33 [1], and here are couple of
> my
> >>>>>>>> thoughts.
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> Big +1 for standardising the metric names between
> >>>>>>>> > connectors,
> >>>>>>>> > > it
> >>>>>>>> > > >>>>> will
> >>>>>>>> > > >>>>>>>>>>>> definitely help us and users a lot.
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> Issues/questions/things to discuss that I’ve
> >>>>>>>> thought of:
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> 1a. If we are about to duplicate some metrics,
> can
> >>>>>>>> this
> >>>>>>>> > > become a
> >>>>>>>> > > >>>>>>>>>>>> performance issue?
> >>>>>>>> > > >>>>>>>>>>>> 1b. Generally speaking, we should make sure that
> >>>>>>>> collecting
> >>>>>>>> > > >> those
> >>>>>>>> > > >>>>>>>>> metrics
> >>>>>>>> > > >>>>>>>>>>>> is as non intrusive as possible, especially that
> >>>>>>>> they will
> >>>>>>>> > > need
> >>>>>>>> > > >>>>> to be
> >>>>>>>> > > >>>>>>>>>>>> updated once per record. (They might be collected
> >>>>>>>> more
> >>>>>>>> > rarely
> >>>>>>>> > > >> with
> >>>>>>>> > > >>>>>>> some
> >>>>>>>> > > >>>>>>>>>>>> overhead, but the hot path of updating it per
> >>>>>>>> record will
> >>>>>>>> > need
> >>>>>>>> > > >> to
> >>>>>>>> > > >>>>> be
> >>>>>>>> > > >>>>>>> as
> >>>>>>>> > > >>>>>>>>>>>> quick as possible). That includes both avoiding
> >>>>>>>> heavy
> >>>>>>>> > > >> computation
> >>>>>>>> > > >>>>> on
> >>>>>>>> > > >>>>>>>>> per
> >>>>>>>> > > >>>>>>>>>>>> record path: histograms?, measuring time for time
> >>>>>>>> based
> >>>>>>>> > > metrics
> >>>>>>>> > > >>>>> (per
> >>>>>>>> > > >>>>>>>>>>>> second) (System.currentTimeMillis() depending on
> >>>>>>>> the
> >>>>>>>> > > >>>>> implementation
> >>>>>>>> > > >>>>>>> can
> >>>>>>>> > > >>>>>>>>>>>> invoke a system call)
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow
> us
> >>>>>>>> to check
> >>>>>>>> > > the
> >>>>>>>> > > >>>>> cause
> >>>>>>>> > > >>>>>>>>> of
> >>>>>>>> > > >>>>>>>>>>>> back pressure:
> >>>>>>>> > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes?
> >>>>>>>> Or boolean
> >>>>>>>> > > >>>>>>>>>>>> hasSomethingl/isEmpty)
> >>>>>>>> > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes?
> Or
> >>>>>>>> boolean
> >>>>>>>> > > >>>>>>>>>>>> hasSomething/isEmpty)
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> a) is useful in a scenario when we are processing
> >>>>>>>> backlog of
> >>>>>>>> > > >>>>> records,
> >>>>>>>> > > >>>>>>>>> all
> >>>>>>>> > > >>>>>>>>>>>> of the internal Flink’s input/output network
> >>>>>>>> buffers are
> >>>>>>>> > > empty,
> >>>>>>>> > > >>>>> and
> >>>>>>>> > > >>>>>>> we
> >>>>>>>> > > >>>>>>>>> want
> >>>>>>>> > > >>>>>>>>>>>> to check whether the external source system is
> the
> >>>>>>>> > bottleneck
> >>>>>>>> > > >>>>>>> (source’s
> >>>>>>>> > > >>>>>>>>>>>> input queue will be empty), or if the Flink’s
> >>>>>>>> connector is
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>>>> bottleneck
> >>>>>>>> > > >>>>>>>>>>>> (source’s input queues will be full).
> >>>>>>>> > > >>>>>>>>>>>> b) similar story. Backlog of records, but this
> >>>>>>>> time all of
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>> internal
> >>>>>>>> > > >>>>>>>>>>>> Flink’s input/ouput network buffers are full, and
> >>>>>>>> we want o
> >>>>>>>> > > >> check
> >>>>>>>> > > >>>>>>>>> whether
> >>>>>>>> > > >>>>>>>>>>>> the external sink system is the bottleneck (sink
> >>>>>>>> output
> >>>>>>>> > queues
> >>>>>>>> > > >> are
> >>>>>>>> > > >>>>>>>>> full),
> >>>>>>>> > > >>>>>>>>>>>> or if the Flink’s connector is the bottleneck
> >>>>>>>> (sink’s output
> >>>>>>>> > > >>>>> queues
> >>>>>>>> > > >>>>>>>>> will be
> >>>>>>>> > > >>>>>>>>>>>> empty)
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> It might be sometimes difficult to provide those
> >>>>>>>> metrics, so
> >>>>>>>> > > >> they
> >>>>>>>> > > >>>>>>> could
> >>>>>>>> > > >>>>>>>>>>>> be optional, but if we could provide them, it
> >>>>>>>> would be
> >>>>>>>> > really
> >>>>>>>> > > >>>>>>> helpful.
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> Piotrek
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>> [1]
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>
> >>>>>>>> > >
> >>>>>>>> >
> >>>>>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics
> >>>>>>>> > > >>>>>>>>>>>> <
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>
> >>>>>>>> > >
> >>>>>>>> >
> >>>>>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen <
> >>>>>>>> se...@apache.org>
> >>>>>>>> > > >> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>> I think this sounds reasonable.
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>> Let's keep the "reconfiguration without stopping
> >>>>>>>> the job"
> >>>>>>>> > out
> >>>>>>>> > > >> of
> >>>>>>>> > > >>>>>>> this,
> >>>>>>>> > > >>>>>>>>>>>>> because that would be a super big effort and if
> >>>>>>>> we approach
> >>>>>>>> > > >> that,
> >>>>>>>> > > >>>>>>> then
> >>>>>>>> > > >>>>>>>>>>>> in
> >>>>>>>> > > >>>>>>>>>>>>> more generic way rather than specific to
> >>>>>>>> connector metrics.
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>> I would suggest to look at the following things
> >>>>>>>> before
> >>>>>>>> > > starting
> >>>>>>>> > > >>>>> with
> >>>>>>>> > > >>>>>>>>> any
> >>>>>>>> > > >>>>>>>>>>>>> implementation work:
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>> - Try and find a committer to support this,
> >>>>>>>> otherwise it
> >>>>>>>> > will
> >>>>>>>> > > >> be
> >>>>>>>> > > >>>>>>> hard
> >>>>>>>> > > >>>>>>>>>>>> to
> >>>>>>>> > > >>>>>>>>>>>>> make progress
> >>>>>>>> > > >>>>>>>>>>>>> - Start with defining a smaller set of "core
> >>>>>>>> metrics" and
> >>>>>>>> > > >> extend
> >>>>>>>> > > >>>>> the
> >>>>>>>> > > >>>>>>>>>>>> set
> >>>>>>>> > > >>>>>>>>>>>>> later. I think that is easier than now blocking
> >>>>>>>> on reaching
> >>>>>>>> > > >>>>>>> consensus
> >>>>>>>> > > >>>>>>>>>>>> on a
> >>>>>>>> > > >>>>>>>>>>>>> large group of metrics.
> >>>>>>>> > > >>>>>>>>>>>>> - Find a solution to the problem Chesnay
> >>>>>>>> mentioned, that
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>> "records
> >>>>>>>> > > >>>>>>>>>>>> in"
> >>>>>>>> > > >>>>>>>>>>>>> metric is somehow overloaded and exists already
> >>>>>>>> in the IO
> >>>>>>>> > > >> Metric
> >>>>>>>> > > >>>>>>>>> group.
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin <
> >>>>>>>> > > >> becket....@gmail.com
> >>>>>>>> > > >>>>>>
> >>>>>>>> > > >>>>>>>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> Hi Stephan,
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> Thanks a lot for the feedback. All makes sense.
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> It is a good suggestion to simply have an
> >>>>>>>> > onRecord(numBytes,
> >>>>>>>> > > >>>>>>>>> eventTime)
> >>>>>>>> > > >>>>>>>>>>>>>> method for connector writers. It should meet
> >>>>>>>> most of the
> >>>>>>>> > > >>>>>>>>> requirements,
> >>>>>>>> > > >>>>>>>>>>>>>> individual
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> The configurable metrics feature is something
> >>>>>>>> really
> >>>>>>>> > useful,
> >>>>>>>> > > >>>>>>>>>>>> especially if
> >>>>>>>> > > >>>>>>>>>>>>>> we can somehow make it dynamically configurable
> >>>>>>>> without
> >>>>>>>> > > >> stopping
> >>>>>>>> > > >>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>> jobs.
> >>>>>>>> > > >>>>>>>>>>>>>> It might be better to make it a separate
> >>>>>>>> discussion
> >>>>>>>> > because
> >>>>>>>> > > it
> >>>>>>>> > > >>>>> is a
> >>>>>>>> > > >>>>>>>>>>>> more
> >>>>>>>> > > >>>>>>>>>>>>>> generic feature instead of only for connectors.
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> So in order to make some progress, in this FLIP
> >>>>>>>> we can
> >>>>>>>> > limit
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>>>>>>>> discussion
> >>>>>>>> > > >>>>>>>>>>>>>> scope to the connector related items:
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> - the standard connector metric names and
> types.
> >>>>>>>> > > >>>>>>>>>>>>>> - the abstract ConnectorMetricHandler interface
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> I'll start a separate thread to discuss other
> >>>>>>>> general
> >>>>>>>> > metric
> >>>>>>>> > > >>>>>>> related
> >>>>>>>> > > >>>>>>>>>>>>>> enhancement items including:
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> - optional metrics
> >>>>>>>> > > >>>>>>>>>>>>>> - dynamic metric configuration
> >>>>>>>> > > >>>>>>>>>>>>>> - potential combination with rate limiter
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> Does this plan sound reasonable?
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen <
> >>>>>>>> > > >> se...@apache.org>
> >>>>>>>> > > >>>>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> Ignoring for a moment implementation details,
> >>>>>>>> this
> >>>>>>>> > > connector
> >>>>>>>> > > >>>>>>> metrics
> >>>>>>>> > > >>>>>>>>>>>> work
> >>>>>>>> > > >>>>>>>>>>>>>>> is a really good thing to do, in my opinion
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> The questions "oh, my job seems to be doing
> >>>>>>>> nothing, I am
> >>>>>>>> > > >>>>> looking
> >>>>>>>> > > >>>>>>> at
> >>>>>>>> > > >>>>>>>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>> UI
> >>>>>>>> > > >>>>>>>>>>>>>>> and the 'records in' value is still zero" is
> in
> >>>>>>>> the top
> >>>>>>>> > > three
> >>>>>>>> > > >>>>>>>>> support
> >>>>>>>> > > >>>>>>>>>>>>>>> questions I have been asked personally.
> >>>>>>>> > > >>>>>>>>>>>>>>> Introspection into "how far is the consumer
> >>>>>>>> lagging
> >>>>>>>> > behind"
> >>>>>>>> > > >>>>> (event
> >>>>>>>> > > >>>>>>>>>>>> time
> >>>>>>>> > > >>>>>>>>>>>>>>> fetch latency) came up many times as well.
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> So big +1 to solving this problem.
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> About the exact design - I would try to go for
> >>>>>>>> the
> >>>>>>>> > > following
> >>>>>>>> > > >>>>>>>>>>>> properties:
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> - keep complexity of of connectors. Ideally
> the
> >>>>>>>> metrics
> >>>>>>>> > > >> handler
> >>>>>>>> > > >>>>>>> has
> >>>>>>>> > > >>>>>>>>> a
> >>>>>>>> > > >>>>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or
> >>>>>>>> so, and
> >>>>>>>> > > >>>>> everything
> >>>>>>>> > > >>>>>>>>>>>> else is
> >>>>>>>> > > >>>>>>>>>>>>>>> internal to the handler. That makes it dead
> >>>>>>>> simple for
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>>>> connector.
> >>>>>>>> > > >>>>>>>>>>>> We
> >>>>>>>> > > >>>>>>>>>>>>>>> can also think of an extensive scheme for
> >>>>>>>> connector
> >>>>>>>> > > specific
> >>>>>>>> > > >>>>>>>>> metrics.
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> - make it configurable on the job it cluster
> >>>>>>>> level which
> >>>>>>>> > > >>>>> metrics
> >>>>>>>> > > >>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>>> handler internally creates when that method is
> >>>>>>>> invoked.
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> What do you think?
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> Best,
> >>>>>>>> > > >>>>>>>>>>>>>>> Stephan
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay
> >>>>>>>> Schepler <
> >>>>>>>> > > >>>>>>>>> ches...@apache.org
> >>>>>>>> > > >>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>> As I said before, I believe this to be
> >>>>>>>> over-engineered
> >>>>>>>> > and
> >>>>>>>> > > >>>>> have
> >>>>>>>> > > >>>>>>> no
> >>>>>>>> > > >>>>>>>>>>>>>>>> interest in this implementation.
> >>>>>>>> > > >>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>> There are conceptual issues like defining a
> >>>>>>>> duplicate
> >>>>>>>> > > >>>>>>>>>>>>>> numBytesIn(PerSec)
> >>>>>>>> > > >>>>>>>>>>>>>>>> metric that already exists for each operator.
> >>>>>>>> > > >>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>> A few updates to the thread. I uploaded a
> >>>>>>>> patch[1] as a
> >>>>>>>> > > >>>>> complete
> >>>>>>>> > > >>>>>>>>>>>>>>>>> example of how users can use the metrics in
> >>>>>>>> the future.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Some thoughts below after taking a look at
> the
> >>>>>>>> > > >>>>>>> AbstractMetricGroup
> >>>>>>>> > > >>>>>>>>>>>>>> and
> >>>>>>>> > > >>>>>>>>>>>>>>>>> its subclasses.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> This patch intends to provide convenience
> for
> >>>>>>>> Flink
> >>>>>>>> > > >> connector
> >>>>>>>> > > >>>>>>>>>>>>>>>>> implementations to follow metrics standards
> >>>>>>>> proposed in
> >>>>>>>> > > >>>>> FLIP-33.
> >>>>>>>> > > >>>>>>>>> It
> >>>>>>>> > > >>>>>>>>>>>>>>>>> also try to enhance the metric management in
> >>>>>>>> general
> >>>>>>>> > way
> >>>>>>>> > > to
> >>>>>>>> > > >>>>> help
> >>>>>>>> > > >>>>>>>>>>>>>> users
> >>>>>>>> > > >>>>>>>>>>>>>>>>> with:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> 1. metric definition
> >>>>>>>> > > >>>>>>>>>>>>>>>>> 2. metric dependencies check
> >>>>>>>> > > >>>>>>>>>>>>>>>>> 3. metric validation
> >>>>>>>> > > >>>>>>>>>>>>>>>>> 4. metric control (turn on / off particular
> >>>>>>>> metrics)
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the
> >>>>>>>> > > functionality
> >>>>>>>> > > >> of
> >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses.
> The
> >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the
> >>>>>>>> metric group
> >>>>>>>> > > >>>>>>> hierarchy,
> >>>>>>>> > > >>>>>>>>>>>> but
> >>>>>>>> > > >>>>>>>>>>>>>>>>> does not really manage the metrics other
> than
> >>>>>>>> keeping
> >>>>>>>> > > them
> >>>>>>>> > > >>>>> in a
> >>>>>>>> > > >>>>>>>>> Map.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Ideally we should only have one entry point
> >>>>>>>> for the
> >>>>>>>> > > >> metrics.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Right now the entry point is
> >>>>>>>> |AbstractMetricGroup|.
> >>>>>>>> > > >> However,
> >>>>>>>> > > >>>>>>>>> besides
> >>>>>>>> > > >>>>>>>>>>>>>>>>> the missing functionality mentioned above,
> >>>>>>>> > > >>>>> |AbstractMetricGroup|
> >>>>>>>> > > >>>>>>>>>>>>>> seems
> >>>>>>>> > > >>>>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could
> >>>>>>>> extract it out
> >>>>>>>> > > to
> >>>>>>>> > > >>>>>>>>>>>>>>>>> flink-metrics in order to use it for generic
> >>>>>>>> purpose.
> >>>>>>>> > > There
> >>>>>>>> > > >>>>> will
> >>>>>>>> > > >>>>>>>>> be
> >>>>>>>> > > >>>>>>>>>>>>>>>>> some work, though.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Another approach is to make
> |AbstractMetrics|
> >>>>>>>> in this
> >>>>>>>> > > patch
> >>>>>>>> > > >>>>> as
> >>>>>>>> > > >>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>>>>> metric entry point. It wraps metric group
> and
> >>>>>>>> provides
> >>>>>>>> > > the
> >>>>>>>> > > >>>>>>> missing
> >>>>>>>> > > >>>>>>>>>>>>>>>>> functionalities. Then we can roll out this
> >>>>>>>> pattern to
> >>>>>>>> > > >> runtime
> >>>>>>>> > > >>>>>>>>>>>>>>>>> components gradually as well.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> My first thought is that the latter approach
> >>>>>>>> gives a
> >>>>>>>> > more
> >>>>>>>> > > >>>>> smooth
> >>>>>>>> > > >>>>>>>>>>>>>>>>> migration. But I am also OK with doing a
> >>>>>>>> refactoring on
> >>>>>>>> > > the
> >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| family.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> [1]
> https://github.com/becketqin/flink/pull/1
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin <
> >>>>>>>> > > >>>>>>> becket....@gmail.com
> >>>>>>>> > > >>>>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Hi Chesnay,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> It might be easier to discuss some
> >>>>>>>> implementation
> >>>>>>>> > details
> >>>>>>>> > > >> in
> >>>>>>>> > > >>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>>>>> PR review instead of in the FLIP discussion
> >>>>>>>> thread. I
> >>>>>>>> > > have
> >>>>>>>> > > >> a
> >>>>>>>> > > >>>>>>>>>>>>>> patch
> >>>>>>>> > > >>>>>>>>>>>>>>>>> for Kafka connectors ready but haven't
> >>>>>>>> submitted the PR
> >>>>>>>> > > >> yet.
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Hopefully that will help explain a bit more.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> ** Re: metric type binding
> >>>>>>>> > > >>>>>>>>>>>>>>>>> This is a valid point that worths
> discussing.
> >>>>>>>> If I
> >>>>>>>> > > >> understand
> >>>>>>>> > > >>>>>>>>>>>>>>>>> correctly, there are two points:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> 1. Metric type / interface does not matter
> as
> >>>>>>>> long as
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>> metric
> >>>>>>>> > > >>>>>>>>>>>>>>>>> semantic is clearly defined.
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Conceptually speaking, I agree that as long
> >>>>>>>> as the
> >>>>>>>> > metric
> >>>>>>>> > > >>>>>>>>>>>>>> semantic
> >>>>>>>> > > >>>>>>>>>>>>>>>>> is defined, metric type does not matter. To
> >>>>>>>> some
> >>>>>>>> > extent,
> >>>>>>>> > > >>>>> Gauge
> >>>>>>>> > > >>>>>>> /
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can
> be
> >>>>>>>> think of
> >>>>>>>> > as
> >>>>>>>> > > >>>>> some
> >>>>>>>> > > >>>>>>>>>>>>>>>>> well-recognized semantics, if you wish. In
> >>>>>>>> Flink, these
> >>>>>>>> > > >>>>> metric
> >>>>>>>> > > >>>>>>>>>>>>>>>>> semantics have their associated interface
> >>>>>>>> classes. In
> >>>>>>>> > > >>>>> practice,
> >>>>>>>> > > >>>>>>>>>>>>>>>>> such semantic to interface binding seems
> >>>>>>>> necessary for
> >>>>>>>> > > >>>>>>> different
> >>>>>>>> > > >>>>>>>>>>>>>>>>> components to communicate.  Simply
> >>>>>>>> standardize the
> >>>>>>>> > > semantic
> >>>>>>>> > > >>>>> of
> >>>>>>>> > > >>>>>>>>>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>>>>> connector metrics seems not sufficient for
> >>>>>>>> people to
> >>>>>>>> > > build
> >>>>>>>> > > >>>>>>>>>>>>>>>>> ecosystem on top of. At the end of the day,
> >>>>>>>> we still
> >>>>>>>> > need
> >>>>>>>> > > >> to
> >>>>>>>> > > >>>>>>>>> have
> >>>>>>>> > > >>>>>>>>>>>>>>>>> some embodiment of the metric semantics that
> >>>>>>>> people can
> >>>>>>>> > > >>>>> program
> >>>>>>>> > > >>>>>>>>>>>>>>>>> against.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be
> >>>>>>>> exposed
> >>>>>>>> > > using
> >>>>>>>> > > >>>>>>>>>>>>>>>>> different metric types / interfaces.
> >>>>>>>> > > >>>>>>>>>>>>>>>>> This is a good point. Counter and
> >>>>>>>> Gauge-as-a-Counter
> >>>>>>>> > are
> >>>>>>>> > > >>>>> pretty
> >>>>>>>> > > >>>>>>>>>>>>>>>>> much interchangeable. This is more of a
> >>>>>>>> trade-off
> >>>>>>>> > between
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>>>>>>>>>> user
> >>>>>>>> > > >>>>>>>>>>>>>>>>> experience of metric producers and
> consumers.
> >>>>>>>> The
> >>>>>>>> > metric
> >>>>>>>> > > >>>>>>>>>>>>>> producers
> >>>>>>>> > > >>>>>>>>>>>>>>>>> want to use Counter or Gauge depending on
> >>>>>>>> whether the
> >>>>>>>> > > >> counter
> >>>>>>>> > > >>>>>>> is
> >>>>>>>> > > >>>>>>>>>>>>>>>>> already tracked in code, while ideally the
> >>>>>>>> metric
> >>>>>>>> > > consumers
> >>>>>>>> > > >>>>>>> only
> >>>>>>>> > > >>>>>>>>>>>>>>>>> want to see a single metric type for each
> >>>>>>>> metric. I am
> >>>>>>>> > > >>>>> leaning
> >>>>>>>> > > >>>>>>>>>>>>>>>>> towards to make the metric producers happy,
> >>>>>>>> i.e. allow
> >>>>>>>> > > >> Gauge
> >>>>>>>> > > >>>>> /
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter metric type, and the the metric
> >>>>>>>> consumers
> >>>>>>>> > handle
> >>>>>>>> > > >> the
> >>>>>>>> > > >>>>>>>>> type
> >>>>>>>> > > >>>>>>>>>>>>>>>>> variation. The reason is that in practice,
> >>>>>>>> there might
> >>>>>>>> > be
> >>>>>>>> > > >>>>> more
> >>>>>>>> > > >>>>>>>>>>>>>>>>> connector implementations than metric
> reporter
> >>>>>>>> > > >>>>> implementations.
> >>>>>>>> > > >>>>>>>>>>>>>> We
> >>>>>>>> > > >>>>>>>>>>>>>>>>> could also provide some helper method to
> >>>>>>>> facilitate
> >>>>>>>> > > reading
> >>>>>>>> > > >>>>>>> from
> >>>>>>>> > > >>>>>>>>>>>>>>>>> such variable metric type.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Just some quick replies to the comments
> around
> >>>>>>>> > > >> implementation
> >>>>>>>> > > >>>>>>>>>>>>>>>> details.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   4) single place where metrics are
> >>>>>>>> registered except
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   connector-specific
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   ones (which we can't really avoid).
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Register connector specific ones in a single
> >>>>>>>> place is
> >>>>>>>> > > >>>>> actually
> >>>>>>>> > > >>>>>>>>>>>>>>>>> something that I want to achieve.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   2) I'm talking about time-series databases
> >>>>>>>> like
> >>>>>>>> > > >>>>> Prometheus.
> >>>>>>>> > > >>>>>>>>>>>>>> We
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   would
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   only have a gauge metric exposing the last
> >>>>>>>> > > >>>>>>>>> fetchTime/emitTime
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   that is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   regularly reported to the backend
> >>>>>>>> (Prometheus),
> >>>>>>>> > where a
> >>>>>>>> > > >>>>>>> user
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   could build
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   a histogram of his choosing when/if he
> >>>>>>>> wants it.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Not sure if such downsampling works. As an
> >>>>>>>> example, if
> >>>>>>>> > a
> >>>>>>>> > > >> user
> >>>>>>>> > > >>>>>>>>>>>>>>>>> complains that there are some intermittent
> >>>>>>>> latency
> >>>>>>>> > spikes
> >>>>>>>> > > >>>>>>> (maybe
> >>>>>>>> > > >>>>>>>>>>>>>> a
> >>>>>>>> > > >>>>>>>>>>>>>>>>> few records in 10 seconds) in their
> >>>>>>>> processing system.
> >>>>>>>> > > >>>>> Having a
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems
> >>>>>>>> unlikely
> >>>>>>>> > > useful.
> >>>>>>>> > > >>>>>>>>>>>>>>>>> However by looking at actual 99.9 percentile
> >>>>>>>> latency
> >>>>>>>> > > might
> >>>>>>>> > > >>>>>>> help.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay
> >>>>>>>> Schepler
> >>>>>>>> > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:
> >>>>>>>> ches...@apache.org>>
> >>>>>>>> > wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Re: over complication of implementation.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   I think I get understand better know what
> >>>>>>>> you're
> >>>>>>>> > > >> shooting
> >>>>>>>> > > >>>>>>>>>>>>>> for,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   effectively something like the
> >>>>>>>> OperatorIOMetricGroup.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   But still, re-define
> >>>>>>>> setupConnectorMetrics() to
> >>>>>>>> > accept
> >>>>>>>> > > a
> >>>>>>>> > > >>>>>>> set
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   of flags
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   for counters/meters(ans _possibly_
> >>>>>>>> histograms) along
> >>>>>>>> > > >>>>> with a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   set of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   well-defined Optional<Gauge<?>>, and
> return
> >>>>>>>> the
> >>>>>>>> > group.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Solves all issues as far as i can tell:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   1) no metrics must be created manually
> >>>>>>>> (except
> >>>>>>>> > Gauges,
> >>>>>>>> > > >>>>>>> which
> >>>>>>>> > > >>>>>>>>>>>>>>> are
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   effectively just Suppliers and you can't
> >>>>>>>> get around
> >>>>>>>> > > >>>>> this),
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   2) additional metrics can be registered on
> >>>>>>>> the
> >>>>>>>> > returned
> >>>>>>>> > > >>>>>>>>>>>>>> group,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   3) see 1),
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   4) single place where metrics are
> >>>>>>>> registered except
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   connector-specific
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   ones (which we can't really avoid).
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Re: Histogram
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   1) As an example, whether "numRecordsIn"
> is
> >>>>>>>> exposed
> >>>>>>>> > as
> >>>>>>>> > > a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Counter or a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Gauge should be irrelevant. So far we're
> >>>>>>>> using the
> >>>>>>>> > > >> metric
> >>>>>>>> > > >>>>>>>>>>>>>> type
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   that is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   the most convenient at exposing a given
> >>>>>>>> value. If
> >>>>>>>> > there
> >>>>>>>> > > >>>>> is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   some backing
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   data-structure that we want to expose some
> >>>>>>>> data from
> >>>>>>>> > we
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   typically opt
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   for a Gauge, as otherwise we're just
> >>>>>>>> mucking around
> >>>>>>>> > > with
> >>>>>>>> > > >>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Meter/Counter API to get it to match.
> >>>>>>>> Similarly, if
> >>>>>>>> > we
> >>>>>>>> > > >>>>> want
> >>>>>>>> > > >>>>>>>>>>>>>> to
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   count
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   something but no current count exists we
> >>>>>>>> typically
> >>>>>>>> > > added
> >>>>>>>> > > >>>>> a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Counter.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   That's why attaching semantics to metric
> >>>>>>>> types makes
> >>>>>>>> > > >>>>> little
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   sense (but
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   unfortunately several reporters already do
> >>>>>>>> it); for
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   counters/meters
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   certainly, but the majority of metrics are
> >>>>>>>> gauges.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   2) I'm talking about time-series databases
> >>>>>>>> like
> >>>>>>>> > > >>>>> Prometheus.
> >>>>>>>> > > >>>>>>>>>>>>>> We
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   would
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   only have a gauge metric exposing the last
> >>>>>>>> > > >>>>>>>>> fetchTime/emitTime
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   that is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   regularly reported to the backend
> >>>>>>>> (Prometheus),
> >>>>>>>> > where a
> >>>>>>>> > > >>>>>>> user
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   could build
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   a histogram of his choosing when/if he
> >>>>>>>> wants it.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   On 22.02.2019 13:57, Becket Qin wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Hi Chesnay,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks for the explanation.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> ** Re: FLIP
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> I might have misunderstood this, but it
> >>>>>>>> seems that
> >>>>>>>> > > "major
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   changes" are well
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> defined in FLIP. The full contents is
> >>>>>>>> following:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> What is considered a "major change" that
> >>>>>>>> needs a FLIP?
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Any of the following should be considered a
> >>>>>>>> major
> >>>>>>>> > > change:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Any major new feature, subsystem, or
> piece
> >>>>>>>> of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   functionality
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - *Any change that impacts the public
> >>>>>>>> interfaces of
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   project*
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> What are the "public interfaces" of the
> >>>>>>>> project?
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> *All of the following are public interfaces
> >>>>>>>> *that
> >>>>>>>> > people
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   build around:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - DataStream and DataSet API, including
> >>>>>>>> classes
> >>>>>>>> > related
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   to that, such as
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> StreamExecutionEnvironment
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Classes marked with the @Public
> annotation
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - On-disk binary formats, such as
> >>>>>>>> > > >>>>>>>>>>>>>> checkpoints/savepoints
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - User-facing scripts/command-line tools,
> >>>>>>>> i.e.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   bin/flink, Yarn scripts,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Mesos scripts
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Configuration settings
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - *Exposed monitoring information*
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> So any monitoring information change is
> >>>>>>>> considered as
> >>>>>>>> > > >>>>>>>>>>>>>> public
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   interface, and
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> any public interface change is considered
> as
> >>>>>>>> a "major
> >>>>>>>> > > >>>>>>>>>>>>>>> change".
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> ** Re: over complication of implementation.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Although this is more of implementation
> >>>>>>>> details that
> >>>>>>>> > is
> >>>>>>>> > > >> not
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   covered by the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> FLIP. But it may be worth discussing.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> First of all, I completely agree that we
> >>>>>>>> should use
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   simplest way to
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> achieve our goal. To me the goal is the
> >>>>>>>> following:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Clear connector conventions and
> >>>>>>>> interfaces.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. The easiness of creating a connector.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Both of them are important to the
> prosperity
> >>>>>>>> of the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   connector ecosystem. So
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> I'd rather abstract as much as possible on
> >>>>>>>> our side to
> >>>>>>>> > > >> make
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   the connector
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> developer's work lighter. Given this goal,
> a
> >>>>>>>> static
> >>>>>>>> > util
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   method approach
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> might have a few drawbacks:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Users still have to construct the
> metrics
> >>>>>>>> by
> >>>>>>>> > > >> themselves.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   (And note that
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> this might be erroneous by itself. For
> >>>>>>>> example, a
> >>>>>>>> > > customer
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   wrapper around
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> dropwizard meter maybe used instead of
> >>>>>>>> MeterView).
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. When connector specific metrics are
> >>>>>>>> added, it is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   difficult to enforce
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> the scope to be the same as standard
> metrics.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 3. It seems that a method proliferation is
> >>>>>>>> inevitable
> >>>>>>>> > if
> >>>>>>>> > > >> we
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   want to apply
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> sanity checks. e.g. The metric of
> numBytesIn
> >>>>>>>> was not
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   registered for a meter.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 4. Metrics are still defined in random
> >>>>>>>> places and hard
> >>>>>>>> > > to
> >>>>>>>> > > >>>>>>>>>>>>>>>> track.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> The current PR I had was inspired by the
> >>>>>>>> Config system
> >>>>>>>> > > in
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Kafka, which I
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> found pretty handy. In fact it is not only
> >>>>>>>> used by
> >>>>>>>> > Kafka
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   itself but even
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> some other projects that depend on Kafka. I
> >>>>>>>> am not
> >>>>>>>> > > saying
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   this approach is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> perfect. But I think it worths to save the
> >>>>>>>> work for
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   connector writers and
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> encourage more systematic implementation.
> >>>>>>>> That being
> >>>>>>>> > > said,
> >>>>>>>> > > >>>>>>>>>>>>>> I
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   am fully open
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> to suggestions.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Re: Histogram
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> I think there are two orthogonal questions
> >>>>>>>> around
> >>>>>>>> > those
> >>>>>>>> > > >>>>>>>>>>>>>>>> metrics:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just
> >>>>>>>> looking at
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   meaning of a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> metric, is generic to all connectors? If
> the
> >>>>>>>> answer is
> >>>>>>>> > > >> yes,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   we should
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> include the metric into the convention. No
> >>>>>>>> matter
> >>>>>>>> > > whether
> >>>>>>>> > > >>>>>>>>>>>>>> we
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   include it
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> into the convention or not, some connector
> >>>>>>>> > > implementations
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   will emit such
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> metric. It is better to have a convention
> >>>>>>>> than letting
> >>>>>>>> > > >> each
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   connector do
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> random things.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. If a standard metric is a histogram,
> what
> >>>>>>>> should we
> >>>>>>>> > > do?
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> I agree that we should make it clear that
> >>>>>>>> using
> >>>>>>>> > > histograms
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   will have
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> performance risk. But I do see histogram is
> >>>>>>>> useful in
> >>>>>>>> > > some
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   fine-granularity
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> debugging where one do not have the luxury
> >>>>>>>> to stop the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   system and inject
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> more inspection code. So the workaround I
> am
> >>>>>>>> thinking
> >>>>>>>> > is
> >>>>>>>> > > >> to
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   provide some
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> implementation suggestions. Assume later on
> >>>>>>>> we have a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   mechanism of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> selective metrics. In the abstract metrics
> >>>>>>>> class we
> >>>>>>>> > can
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   disable those
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> metrics by default individual connector
> >>>>>>>> writers does
> >>>>>>>> > not
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   have to do
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> anything (this is another advantage of
> >>>>>>>> having an
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   AbstractMetrics instead of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> static util methods.)
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> I am not sure I fully understand the
> >>>>>>>> histogram in the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   backend approach. Can
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> you explain a bit more? Do you mean
> emitting
> >>>>>>>> the raw
> >>>>>>>> > > data,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   e.g. fetchTime
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> and emitTime with each record and let the
> >>>>>>>> histogram
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   computation happen in
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> the background? Or let the processing
> thread
> >>>>>>>> putting
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   values into a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> queue and have a separate thread polling
> >>>>>>>> from the
> >>>>>>>> > queue
> >>>>>>>> > > >> and
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   add them into
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> the histogram?
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay
> >>>>>>>> Schepler
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   <ches...@apache.org <mailto:
> >>>>>>>> ches...@apache.org>>
> >>>>>>>> > > wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: Flip
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> The very first line under both the main
> >>>>>>>> header and
> >>>>>>>> > > >> Purpose
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   section
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> describe Flips as "major changes", which
> >>>>>>>> this isn't.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: complication
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> I'm not arguing against standardization,
> >>>>>>>> but again an
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   over-complicated
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> implementation when a static utility
> method
> >>>>>>>> would be
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   sufficient.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> public static void setupConnectorMetrics(
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> String connectorName,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> ...)
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> This gives you all you need:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * a well-defined set of metrics for a
> >>>>>>>> connector to
> >>>>>>>> > > opt-in
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * standardized naming schemes for scope
> and
> >>>>>>>> > individual
> >>>>>>>> > > >>>>>>>>>>>>>>> metrics
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * standardize metric types (although
> >>>>>>>> personally I'm
> >>>>>>>> > not
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   interested in that
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> since metric types should be considered
> >>>>>>>> syntactic
> >>>>>>>> > > sugar)
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: Configurable Histogram
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> If anything they _must_ be turned off by
> >>>>>>>> default, but
> >>>>>>>> > > the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   metric system is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> already exposing so many options that I'm
> >>>>>>>> not too
> >>>>>>>> > keen
> >>>>>>>> > > on
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   adding even more.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> You have also only addressed my first
> >>>>>>>> argument
> >>>>>>>> > against
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   histograms
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> (performance), the second one still stands
> >>>>>>>> (calculate
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   histogram in metric
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> backends instead).
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Hi Chesnay,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is
> >>>>>>>> worthy of a
> >>>>>>>> > > >> FLIP
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   because it is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> public API. According to the FLIP
> >>>>>>>> description a FlIP
> >>>>>>>> > > is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   required in case
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> of:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - Any change that impacts the public
> >>>>>>>> interfaces of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   the project
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> and the following entry is found in the
> >>>>>>>> definition
> >>>>>>>> > of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   "public interface".
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - Exposed monitoring information
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Metrics are critical to any production
> >>>>>>>> system. So a
> >>>>>>>> > > >> clear
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   metric
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> definition
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> is important for any serious users. For
> an
> >>>>>>>> > > organization
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   with large Flink
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> installation, change in metrics means
> >>>>>>>> great amount
> >>>>>>>> > of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   work. So such
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> changes
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> do need to be fully discussed and
> >>>>>>>> documented.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: Histogram.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> We can discuss whether there is a better
> >>>>>>>> way to
> >>>>>>>> > expose
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   metrics that are
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> suitable for histograms. My
> >>>>>>>> micro-benchmark on
> >>>>>>>> > various
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   histogram
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> implementations also indicates that they
> >>>>>>>> are
> >>>>>>>> > > >>>>>>>>>>>>>> significantly
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   slower than
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> other metric types. But I don't think
> that
> >>>>>>>> means
> >>>>>>>> > never
> >>>>>>>> > > >>>>>>>>>>>>>> use
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   histogram, but
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> means use it with caution. For example,
> we
> >>>>>>>> can
> >>>>>>>> > suggest
> >>>>>>>> > > >>>>>>>>>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> implementations
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> to turn them off by default and only turn
> >>>>>>>> it on for
> >>>>>>>> > a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   small amount of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> time
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> when performing some micro-debugging.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: complication:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Connector conventions are essential for
> >>>>>>>> Flink
> >>>>>>>> > > ecosystem.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink connectors
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> pool is probably the most important part
> >>>>>>>> of Flink,
> >>>>>>>> > > just
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   like any other
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> data
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> system. Clear conventions of connectors
> >>>>>>>> will help
> >>>>>>>> > > build
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink ecosystem
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> in
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a more organic way.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Take the metrics convention as an
> example,
> >>>>>>>> imagine
> >>>>>>>> > > >>>>>>>>>>>>>> someone
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   has developed
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Flink connector for System foo, and
> another
> >>>>>>>> > developer
> >>>>>>>> > > >> may
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   have developed
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for
> >>>>>>>> Flink which
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   analyzes the Flink
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> job
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> performance based on metrics. With a
> clear
> >>>>>>>> metric
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   convention, those two
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> projects could be developed
> independently.
> >>>>>>>> Once
> >>>>>>>> > users
> >>>>>>>> > > >> put
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   them together,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> it would work without additional
> >>>>>>>> modifications. This
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   cannot be easily
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> achieved by just defining a few
> constants.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: selective metrics:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate
> >>>>>>>> thread.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> @Dawid
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> The primary purpose of establish such a
> >>>>>>>> convention
> >>>>>>>> > is
> >>>>>>>> > > to
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   help developers
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> write connectors in a more compatible
> way.
> >>>>>>>> The
> >>>>>>>> > > >> convention
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   is supposed to
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> be
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> defined more proactively. So when look at
> >>>>>>>> the
> >>>>>>>> > > >> convention,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   it seems more
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> important to see if the concept is
> >>>>>>>> applicable to
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   connectors in general.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> It
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> might be true so far only Kafka connector
> >>>>>>>> reports
> >>>>>>>> > > >>>>>>>>>>>>>> latency.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   But there
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> might
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> be hundreds of other connector
> >>>>>>>> implementations in
> >>>>>>>> > the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink ecosystem,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of
> >>>>>>>> them also
> >>>>>>>> > > >> emits
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   latency. I
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> think
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a lot of other sources actually also has
> >>>>>>>> an append
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   timestamp. e.g.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> database
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I
> >>>>>>>> wouldn't be
> >>>>>>>> > > surprised
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   if some database
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> connector can also emit latency metrics.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay
> >>>>>>>> Schepler
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   <ches...@apache.org <mailto:
> >>>>>>>> ches...@apache.org>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to
> >>>>>>>> investigate
> >>>>>>>> > > this
> >>>>>>>> > > >>>>>>>>>>>>>> as
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   part of this
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of
> >>>>>>>> interest
> >>>>>>>> > for
> >>>>>>>> > > >>>>>>>>>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   entire metric
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> system, and should be designed for as
> >>>>>>>> such.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to
> >>>>>>>> the core of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink. They are
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> significantly more expensive than other
> >>>>>>>> metrics,
> >>>>>>>> > and
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   calculating
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> histograms in the application is
> regarded
> >>>>>>>> as an
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   anti-pattern by several
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> metric backends, who instead recommend
> to
> >>>>>>>> expose
> >>>>>>>> > the
> >>>>>>>> > > >> raw
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   data and
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> calculate the histogram in the backend.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Second, this seems overly complicated.
> >>>>>>>> Given that
> >>>>>>>> > we
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   already established
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> that not all connectors will export all
> >>>>>>>> metrics we
> >>>>>>>> > > are
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   effectively
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> reducing this down to a consistent
> naming
> >>>>>>>> scheme.
> >>>>>>>> > We
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   don't need anything
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a
> >>>>>>>> few
> >>>>>>>> > > constants
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   that all
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connectors use.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of
> >>>>>>>> a FLIP.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz
> >>>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I
> >>>>>>>> agree. But
> >>>>>>>> > those
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   particular metrics
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now
> would
> >>>>>>>> only be
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   reported if user uses
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal
> >>>>>>>> timestampAssigner with
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   StreamCharacteristic
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds
> >>>>>>>> like a very
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   specific case. I am
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> not
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic
> >>>>>>>> metric that
> >>>>>>>> > > will
> >>>>>>>> > > >>>>>>>>>>>>>> be
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> disabled/absent for most of
> >>>>>>>> implementations.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal
> >>>>>>>> issue, that
> >>>>>>>> > > might
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   make sense to
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> investigate in the future.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Dawid
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Hi Dawid,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes
> >>>>>>>> sense to me.
> >>>>>>>> > > >> There
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   are two cases
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> to
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> addressed.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a
> >>>>>>>> guidance. It
> >>>>>>>> > is
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   likely that a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> only supports some but not all of the
> >>>>>>>> metrics. In
> >>>>>>>> > > >> that
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   case, each
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> implementation should have the freedom
> >>>>>>>> to decide
> >>>>>>>> > > >> which
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   metrics are
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are
> >>>>>>>> supported, the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   guidance should be
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> followed.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable
> >>>>>>>> certain
> >>>>>>>> > > >> metrics
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   for some reason
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of
> >>>>>>>> data). A
> >>>>>>>> > > generic
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   mechanism should
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> be
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> provided to allow user choose which
> >>>>>>>> metrics are
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   reported. This
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> mechanism
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> should also be honored by the
> connector
> >>>>>>>> > > >>>>>>>>>>>>>> implementations.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you?
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid
> >>>>>>>> Wysakowicz
> >>>>>>>> > <
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto:
> >>>>>>>> > > dwysakow...@apache.org
> >>>>>>>> > > >>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a
> >>>>>>>> unified,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   standard set of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight
> >>>>>>>> concerns
> >>>>>>>> > about
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   fetchLatency and
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> latency though. They are computed
> >>>>>>>> based on
> >>>>>>>> > > EventTime
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   which is not a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> purely
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often
> on
> >>>>>>>> some
> >>>>>>>> > > business
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   logic, might be
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> absent
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> or defined after source. Those
> metrics
> >>>>>>>> could
> >>>>>>>> > also
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   behave in a weird
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> way in
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore
> I
> >>>>>>>> am not
> >>>>>>>> > sure
> >>>>>>>> > > >> if
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   we should
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> include
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we
> >>>>>>>> could at
> >>>>>>>> > least
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   introduce a feature
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think?
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Dawid
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin
> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to
> the
> >>>>>>>> proposed
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   metrics. I'll start a
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> voting thread later toady.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM
> Becket
> >>>>>>>> Qin
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   <becket....@gmail.com <mailto:
> >>>>>>>> becket....@gmail.com>>
> >>>>>>>> > <
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto:
> >>>>>>>> becket....@gmail.com
> >>>>>>>> > >>
> >>>>>>>> > > >>>>>>>>>>>>>>> wrote:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi folks,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP
> >>>>>>>> discussion thread
> >>>>>>>> > > >>>>>>>>>>>>>> about
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   standardize
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> the
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> connector metrics.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a
> >>>>>>>> convention
> >>>>>>>> > of
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink connector
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the
> >>>>>>>> monitoring
> >>>>>>>> > and
> >>>>>>>> > > >>>>>>>>>>>>>>>>>   alerting on Flink
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> jobs.
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> The FLIP link is following:
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>
> >>>>>>>> > >
> >>>>>>>> >
> >>>>>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>>>>
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>>>
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>>>
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>>>>
> >>>>>>>> > > >>
> >>>>>>>> > > >>
> >>>>>>>> > >
> >>>>>>>> > >
> >>>>>>>> >
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Konstantin Knauf
> >>>>>>>>
> >>>>>>>> https://twitter.com/snntrable
> >>>>>>>>
> >>>>>>>> https://github.com/knaufk
> >>>>>>>>
> >>>>>>>
>


-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk

Re: [DISCUSS] FLIP-33: Standardize connector metrics

Reply via email to