Re: [DISCUSS] FLIP-33: Standardize connector metrics

Stephan Ewen Fri, 18 Sep 2020 07:55:10 -0700

Having the watermark lag metric was the important part from my side.

So +1 to go ahead.


On Fri, Sep 18, 2020 at 4:11 PM Becket Qin <becket....@gmail.com> wrote:

> Hey Stephan,
>
> Thanks for the quick reply. I actually forgot to mention that I also added
> a "watermarkLag" metric to the Source metrics in addition to the
> "currentFetchEventTimeLag" and "currentEmitEventTimeLag".
> So in the current proposal, both XXEventTimeLag will be based on event
> timestamps, while the watermarkLag is based on watermark. And all the lags
> are defined against the wall clock time. Does this cover what you are
> proposing?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Sep 18, 2020 at 6:08 PM Stephan Ewen <step...@ververica.com>
> wrote:
>
>> Hi Becket!
>>
>> I am wondering if it makes sense to do the following small change:
>>
>>   - Have "currentFetchEventTimeLag" be defined on event timestamps
>> (optionally, if the source system exposes it)  <== this is like in your
>> proposal
>>       this helps understand how long the records were in the source before
>>
>>   - BUT change "currentEmitEventTimeLag" to be
>> "currentSourceWatermarkLag" instead.
>>     That way, users can see how far behind the wall-clock-time the source
>> progress is all in all.
>>     It is also a more well defined metric as it does not oscillate with
>> out-of-order events.
>>
>> What do you think?
>>
>> Best,
>> Stephan
>>
>>
>>
>> On Fri, Sep 18, 2020 at 12:02 PM Becket Qin <becket....@gmail.com> wrote:
>>
>>> Hi folks,
>>>
>>> Thanks for all the great feedback. I have just updated FLIP-33 wiki with
>>> the following changes:
>>>
>>> 1. Renaming. "currentFetchLatency" to "currentFetchEventTimeLag",
>>> "currentLatency" to "currentEmitEventTimeLag".
>>> 2. Added the public interface code change required for the new metrics.
>>> 3. Added description of whether a metric is predefined or optional, and
>>> which component is expected to update the metric.
>>>
>>> Please let me know if you have any questions. I'll start a vote in two
>>> days if there are no further concerns.
>>>
>>> Thanks,
>>>
>>> Jiangjie (Becket) Qin
>>>
>>> On Wed, Sep 9, 2020 at 9:56 AM Becket Qin <becket....@gmail.com> wrote:
>>>
>>>> Hi Stephan,
>>>>
>>>> Thanks for the input. Just a few more clarifications / questions.
>>>>
>>>> *Num Bytes / Records Metrics*
>>>>
>>>> 1. At this point, the *numRecordsIn(Rate)* metrics exist in both
>>>> OperatorIOMetricGroup and TaskIOMetricGroup. I did not find
>>>> *numRecordsIn(Rate)* in the TaskIOMetricGroup updated anywhere other
>>>> than in the unit tests. Am I missing something?
>>>>
>>>> 2. *numBytesIn(Rate)* metrics only exist in TaskIOMetricGroup. At this
>>>> point, the SourceReaders only has access to a SourceReaderContext which
>>>> provides an OperatorMetricGroup. So it seems that the connector developers
>>>> are not able to update the *numBytesIn(Rate). *With the multiple
>>>> Source chaining support, it is possible that there are multiple Sources are
>>>> in the same task. So it looks that we need to add *numBytesIn(Rate)*
>>>> to the operator metrics as well.
>>>>
>>>>
>>>> *Current (Fetch) Latency*
>>>>
>>>> *currentFetchLatency* helps clearly tell whether the latency is caused
>>>> by Flink or not. Backpressure is not the only reason that we see fetch
>>>> latency. Even if there is no back pressure, the records may have passed a
>>>> long pipeline before they entered Flink. For example, say the 
>>>> *currentLatency
>>>> *is 10 seconds and there is no backpressure. Does that mean the record
>>>> spent 10 seconds in the Source operator? If not, how much did Flink
>>>> contribute to that 10 seconds of latency? These questions are frequently
>>>> asked and hard to tell without the fetch latency.
>>>>
>>>> For "currentFetchLatency", we would need to understand timestamps
>>>>> before the records are decoded. That is only possible for some sources,
>>>>> where the client gives us the records in a (partially) decoded from 
>>>>> already
>>>>> (like Kafka). Then, some work has been done between the fetch time and the
>>>>> time we update the metric already, so it is already a bit closer to the
>>>>> "currentFetchLatency". I think following this train of thought, there is
>>>>> diminished benefit from that specific metric.
>>>>
>>>>
>>>> We may not have to report the fetch latency before records are decoded.
>>>> One solution is to remember the* FetchTime* when the encoded records
>>>> are fetched, and report the fetch latency after the records are decoded by
>>>> computing (*FetchTime - EventTime*). An approximate implementation
>>>> would be adding a *FetchTime *field to the *RecordsWithSplitIds* assuming
>>>> that all the records in that data structure are fetched at the same time.
>>>>
>>>> Thoughts?
>>>>
>>>> Thanks,
>>>>
>>>> Jiangjie (Becket) Qin
>>>>
>>>> On Wed, Sep 9, 2020 at 12:42 AM Stephan Ewen <step...@ververica.com>
>>>> wrote:
>>>>
>>>>> Thanks for reviving this, Becket!
>>>>>
>>>>> I think Konstantin's comments are great. I'd add these points:
>>>>>
>>>>> *Num Bytes / Records Metrics*
>>>>>
>>>>> For "numBytesIn" and "numRecordsIn", we should reuse the
>>>>> OperatorIOMetric group, then it also gets reported to the overview page in
>>>>> the Web UI.
>>>>>
>>>>> The "numBytesInPerSecond" and "numRecordsInPerSecond" are
>>>>> automatically derived metrics, no need to do anything once we populate the
>>>>> above two metrics
>>>>>
>>>>>
>>>>> *Current (Fetch) Latency*
>>>>>
>>>>> I would really go for "eventTimeLag" rather than "fetchLatency". I
>>>>> think "eventTimeLag" is a term that has some adoption in the Flink
>>>>> community and beyond.
>>>>>
>>>>> I am not so sure that I see the benefit between "currentLatency" and
>>>>> "currentFetchLatency", (or event time lag before/after) as this only is
>>>>> different by the time it takes to emit a batch.
>>>>>      - In a non-backpressured case, these should be virtually
>>>>> identical (and both dominated by watermark lag, not the actual time it
>>>>> takes the fetch to be emitted)
>>>>>      - In a backpressured case, why do you care about when data was
>>>>> fetched, as opposed to emitted? Emitted time is relevant for application
>>>>> semantics and checkpoints. Fetch time seems to be an implementation detail
>>>>> (how much does the source buffer).
>>>>>
>>>>> The "currentLatency" (eventTimeLagAfter) can be computed
>>>>> out-of-the-box, independent of a source implementation, so that is also a
>>>>> good argument to make it the main metric.
>>>>> We know timestamps and watermarks in the source. Except for cases
>>>>> where no watermarks have been defined at all (batch jobs or pure 
>>>>> processing
>>>>> time jobs), in which case this metric should probably be "Infinite".
>>>>>
>>>>> For "currentFetchLatency", we would need to understand timestamps
>>>>> before the records are decoded. That is only possible for some sources,
>>>>> where the client gives us the records in a (partially) decoded from 
>>>>> already
>>>>> (like Kafka). Then, some work has been done between the fetch time and the
>>>>> time we update the metric already, so it is already a bit closer to the
>>>>> "currentFetchLatency". I think following this train of thought, there is
>>>>> diminished benefit from that specific metric.
>>>>>
>>>>>
>>>>> *Idle Time*
>>>>>
>>>>> I agree, it would be great to rename this. Maybe to "sourceWaitTime"
>>>>> or "sourceIdleTime" so to make clear that this is not exactly the time 
>>>>> that
>>>>> Flink's processing pipeline is idle, but the time where the source does 
>>>>> not
>>>>> have new data.
>>>>>
>>>>> This is not an easy metric to collect, though (except maybe for the
>>>>> sources that are only idle while they have no split assigned, like
>>>>> continuous file source).
>>>>>
>>>>> *Source Specific Metrics*
>>>>>
>>>>> I believe source-specific would only be "sourceIdleTime",
>>>>> "numRecordsInErrors", "pendingBytes", and "pendingRecords".
>>>>>
>>>>>
>>>>> *Conclusion*
>>>>>
>>>>> We can probably add "numBytesIn" and "numRecordsIn" and "eventTimeLag"
>>>>> right away, with little complexity.
>>>>> I'd suggest to start with these right away.
>>>>>
>>>>> Best,
>>>>> Stephan
>>>>>
>>>>>
>>>>> On Tue, Sep 8, 2020 at 3:25 PM Becket Qin <becket....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Konstantin,
>>>>>>
>>>>>> Thanks for the feedback and suggestions. Please see the reply below.
>>>>>>
>>>>>> * idleTime: In the meantime, a similar metric "idleTimeMsPerSecond"
>>>>>>> has
>>>>>>> been introduced in https://issues.apache.org/jira/browse/FLINK-16864.
>>>>>>> They
>>>>>>> have a similar name, but different definitions of idleness,
>>>>>>> e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it is
>>>>>>> backpressured. Can we make it clearer that these two metrics mean
>>>>>>> different
>>>>>>> things?
>>>>>>
>>>>>>
>>>>>> That is a good point. I did not notice this metric earlier. It seems
>>>>>> that both metrics are useful to the users. One tells them how busy the
>>>>>> source is and how much more throughput the source can handle. The other
>>>>>> tells the users how long since the source has seen the last record, which
>>>>>> is useful for debugging. I'll update the FLIP to make it clear.
>>>>>>
>>>>>>   * "current(Fetch)Latency" I am wondering if
>>>>>>> "eventTimeLag(Before|After)"
>>>>>>> is more descriptive/clear. What do others think?
>>>>>>
>>>>>>
>>>>>> I am quite open to the ideas on these names. In fact I also feel
>>>>>> "current(Fetch)Latency" are not super intuitive. So it would be great if 
>>>>>> we
>>>>>> can have better names.
>>>>>>
>>>>>>   * Current(Fetch)Latency implies that the timestamps are directly
>>>>>>> extracted in the source connector, right? Will this be the default
>>>>>>> for
>>>>>>> FLIP-27 sources anyway?
>>>>>>
>>>>>>
>>>>>> The "currentFetchLatency" will probably be reported by each source
>>>>>> implementation, because the data fetching is done by SplitReaders and 
>>>>>> there
>>>>>> is no base implementation. The "currentLatency", on the other hand, can 
>>>>>> be
>>>>>> reported by the SourceReader base implementation. However, since 
>>>>>> developers
>>>>>> can actually implement their own source connector without using our base
>>>>>> implementation, these metric guidance are still useful for the connector
>>>>>> developers in that case.
>>>>>>
>>>>>> * Does FLIP-33 also include the implementation of these metrics (to
>>>>>>> the
>>>>>>> extent possible) for all connectors currently available in Apache
>>>>>>> Flink or
>>>>>>> is the "per-connector implementation" out-of-scope?
>>>>>>
>>>>>>
>>>>>> FLIP-33 itself does not specify any implementation of those metrics.
>>>>>> But the connectors we provide in Apache Flink will follow the guidance of
>>>>>> FLIP-33 to emit those metrics when applicable. Maybe We can have some
>>>>>> public static Strings defined for the metric names to help other 
>>>>>> connector
>>>>>> developers follow the same guidance. I can also add that to the public
>>>>>> interface section of the FLIP if we decide to do that.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jiangjie (Becket) Qin
>>>>>>
>>>>>> On Tue, Sep 8, 2020 at 9:02 PM Becket Qin <becket....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 8, 2020 at 6:55 PM Konstantin Knauf <kna...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Becket,
>>>>>>>>
>>>>>>>> Thank you for picking up this FLIP. I have a few questions:
>>>>>>>>
>>>>>>>> * two thoughts on naming:
>>>>>>>>    * idleTime: In the meantime, a similar metric
>>>>>>>> "idleTimeMsPerSecond" has
>>>>>>>> been introduced in
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-16864. They
>>>>>>>> have a similar name, but different definitions of idleness,
>>>>>>>> e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it is
>>>>>>>> backpressured. Can we make it clearer that these two metrics mean
>>>>>>>> different
>>>>>>>> things?
>>>>>>>>
>>>>>>>>   * "current(Fetch)Latency" I am wondering if
>>>>>>>> "eventTimeLag(Before|After)"
>>>>>>>> is more descriptive/clear. What do others think?
>>>>>>>>
>>>>>>>>   * Current(Fetch)Latency implies that the timestamps are directly
>>>>>>>> extracted in the source connector, right? Will this be the default
>>>>>>>> for
>>>>>>>> FLIP-27 sources anyway?
>>>>>>>>
>>>>>>>> * Does FLIP-33 also include the implementation of these metrics (to
>>>>>>>> the
>>>>>>>> extent possible) for all connectors currently available in Apache
>>>>>>>> Flink or
>>>>>>>> is the "per-connector implementation" out-of-scope?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Konstantin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 4, 2020 at 4:56 PM Becket Qin <becket....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> > Hi all,
>>>>>>>> >
>>>>>>>> > To complete the Source refactoring work, I'd like to revive this
>>>>>>>> > discussion. Since the mail thread has been dormant for more than
>>>>>>>> a year,
>>>>>>>> > just to recap the motivation of the FLIP:
>>>>>>>> >
>>>>>>>> > 1. The FLIP proposes to standardize the connector metrics by
>>>>>>>> giving
>>>>>>>> > guidance on the metric specifications, including the name, type
>>>>>>>> and meaning
>>>>>>>> > of the metrics.
>>>>>>>> > 2. It is OK for a connector to not emit some of the metrics in
>>>>>>>> the metric
>>>>>>>> > guidance, but if a metric of the same semantic is emitted, the
>>>>>>>> > implementation should follow the guidance.
>>>>>>>> > 3. It is OK for a connector to emit more metrics than what are
>>>>>>>> listed in
>>>>>>>> > the FLIP. This includes having an alias for a metric specified in
>>>>>>>> the FLIP.
>>>>>>>> > 4. We will implement some of the metrics out of the box in the
>>>>>>>> default
>>>>>>>> > implementation of FLIP-27, as long as it is applicable.
>>>>>>>> >
>>>>>>>> > The FLIP wiki is following:
>>>>>>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33
>>>>>>>> > %3A+Standardize+Connector+Metrics
>>>>>>>> >
>>>>>>>> > Any thoughts?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> >
>>>>>>>> > Jiangjie (Becket) Qin
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Fri, Jun 14, 2019 at 2:29 PM Piotr Nowojski <
>>>>>>>> pi...@ververica.com>
>>>>>>>> > wrote:
>>>>>>>> >
>>>>>>>> > > > we will need to revisit the convention list and adjust them
>>>>>>>> accordingly
>>>>>>>> > > when FLIP-27 is ready
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > Yes, this sounds good :)
>>>>>>>> > >
>>>>>>>> > > Piotrek
>>>>>>>> > >
>>>>>>>> > > > On 13 Jun 2019, at 13:35, Becket Qin <becket....@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > > >
>>>>>>>> > > > Hi Piotr,
>>>>>>>> > > >
>>>>>>>> > > > That's great to know. Chances are that we will need to
>>>>>>>> revisit the
>>>>>>>> > > > convention list and adjust them accordingly when FLIP-27 is
>>>>>>>> ready, At
>>>>>>>> > > that
>>>>>>>> > > > point we can mark some of the metrics as available by default
>>>>>>>> for
>>>>>>>> > > > connectors implementing the new interface.
>>>>>>>> > > >
>>>>>>>> > > > Thanks,
>>>>>>>> > > >
>>>>>>>> > > > Jiangjie (Becket) Qin
>>>>>>>> > > >
>>>>>>>> > > > On Thu, Jun 13, 2019 at 3:51 PM Piotr Nowojski <
>>>>>>>> pi...@ververica.com>
>>>>>>>> > > wrote:
>>>>>>>> > > >
>>>>>>>> > > >> Thanks for driving this. I’ve just noticed one small thing.
>>>>>>>> With new
>>>>>>>> > > >> SourceReader interface Flink will be able to provide
>>>>>>>> `idleTime` metric
>>>>>>>> > > >> automatically.
>>>>>>>> > > >>
>>>>>>>> > > >> Piotrek
>>>>>>>> > > >>
>>>>>>>> > > >>> On 13 Jun 2019, at 03:30, Becket Qin <becket....@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > > >>>
>>>>>>>> > > >>> Thanks all for the feedback and discussion.
>>>>>>>> > > >>>
>>>>>>>> > > >>> Since there wasn't any concern raised, I've started the
>>>>>>>> voting thread
>>>>>>>> > > for
>>>>>>>> > > >>> this FLIP, but please feel free to continue the discussion
>>>>>>>> here if
>>>>>>>> > you
>>>>>>>> > > >>> think something still needs to be addressed.
>>>>>>>> > > >>>
>>>>>>>> > > >>> Thanks,
>>>>>>>> > > >>>
>>>>>>>> > > >>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>
>>>>>>>> > > >>>
>>>>>>>> > > >>>
>>>>>>>> > > >>> On Mon, Jun 10, 2019 at 9:10 AM Becket Qin <
>>>>>>>> becket....@gmail.com>
>>>>>>>> > > wrote:
>>>>>>>> > > >>>
>>>>>>>> > > >>>> Hi Piotr,
>>>>>>>> > > >>>>
>>>>>>>> > > >>>> Thanks for the comments. Yes, you are right. Users will
>>>>>>>> have to look
>>>>>>>> > > at
>>>>>>>> > > >>>> other metrics to decide whether the pipeline is healthy or
>>>>>>>> not in
>>>>>>>> > the
>>>>>>>> > > >> first
>>>>>>>> > > >>>> place before they can use the time-based metric to fix the
>>>>>>>> > bottleneck.
>>>>>>>> > > >>>>
>>>>>>>> > > >>>> I agree that once we have FLIP-27 ready, some of the
>>>>>>>> metrics can
>>>>>>>> > just
>>>>>>>> > > be
>>>>>>>> > > >>>> reported by the abstract implementation.
>>>>>>>> > > >>>>
>>>>>>>> > > >>>> I've updated FLIP-33 wiki page to add the pendingBytes and
>>>>>>>> > > >> pendingRecords
>>>>>>>> > > >>>> metric. Please let me know if you have any concern over
>>>>>>>> the updated
>>>>>>>> > > >> metric
>>>>>>>> > > >>>> convention proposal.
>>>>>>>> > > >>>>
>>>>>>>> > > >>>> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen
>>>>>>>> > > >>>> <step...@ververica.com> will you also have time to take a
>>>>>>>> look at
>>>>>>>> > the
>>>>>>>> > > >>>> proposed metric convention? If there is no further concern
>>>>>>>> I'll
>>>>>>>> > start
>>>>>>>> > > a
>>>>>>>> > > >>>> voting thread for this FLIP in two days.
>>>>>>>> > > >>>>
>>>>>>>> > > >>>> Thanks,
>>>>>>>> > > >>>>
>>>>>>>> > > >>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>
>>>>>>>> > > >>>>
>>>>>>>> > > >>>>
>>>>>>>> > > >>>> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski <
>>>>>>>> pi...@ververica.com>
>>>>>>>> > > >> wrote:
>>>>>>>> > > >>>>
>>>>>>>> > > >>>>> Hi Becket,
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> Thanks for the answer :)
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>>> By time-based metric, I meant the portion of time spent
>>>>>>>> on
>>>>>>>> > producing
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>> record to downstream. For example, a source connector
>>>>>>>> can report
>>>>>>>> > > that
>>>>>>>> > > >>>>> it's
>>>>>>>> > > >>>>>> spending 80% of time to emit record to downstream
>>>>>>>> processing
>>>>>>>> > > pipeline.
>>>>>>>> > > >>>>> In
>>>>>>>> > > >>>>>> another case, a sink connector may report that its
>>>>>>>> spending 30% of
>>>>>>>> > > >> time
>>>>>>>> > > >>>>>> producing the records to the external system.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> This is in some sense equivalent to the buffer usage
>>>>>>>> metric:
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>>> - 80% of time spent on emitting records to downstream
>>>>>>>> --->
>>>>>>>> > > downstream
>>>>>>>> > > >>>>>> node is bottleneck ---> output buffer is probably full.
>>>>>>>> > > >>>>>> - 30% of time spent on emitting records to downstream
>>>>>>>> --->
>>>>>>>> > > downstream
>>>>>>>> > > >>>>>> node is not bottleneck ---> output buffer is probably
>>>>>>>> not full.
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> If by “time spent on emitting records to downstream” you
>>>>>>>> understand
>>>>>>>> > > >>>>> “waiting on back pressure”, then I see your point. And I
>>>>>>>> agree that
>>>>>>>> > > >> some
>>>>>>>> > > >>>>> kind of ratio/time based metric gives you more
>>>>>>>> information. However
>>>>>>>> > > >> under
>>>>>>>> > > >>>>> “time spent on emitting records to downstream” might be
>>>>>>>> hidden the
>>>>>>>> > > >>>>> following (extreme) situation:
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> 1. Job is barely able to handle influx of records, there
>>>>>>>> is 99%
>>>>>>>> > > >>>>> CPU/resource usage in the cluster, but nobody is
>>>>>>>> > > >>>>> bottlenecked/backpressured, all output buffers are empty,
>>>>>>>> everybody
>>>>>>>> > > is
>>>>>>>> > > >>>>> waiting in 1% of it’s time for more records to process.
>>>>>>>> > > >>>>> 2. 80% time can still be spent on "down stream
>>>>>>>> operators”, because
>>>>>>>> > > they
>>>>>>>> > > >>>>> are the CPU intensive operations, but this doesn’t mean
>>>>>>>> that
>>>>>>>> > > >> increasing the
>>>>>>>> > > >>>>> parallelism down the stream will help with anything
>>>>>>>> there. To the
>>>>>>>> > > >> contrary,
>>>>>>>> > > >>>>> increasing parallelism of the source operator might help
>>>>>>>> to
>>>>>>>> > increase
>>>>>>>> > > >>>>> resource utilisation up to 100%.
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> However, this “time based/ratio” approach can be extended
>>>>>>>> to
>>>>>>>> > > in/output
>>>>>>>> > > >>>>> buffer usage. Besides collecting an information that
>>>>>>>> input/output
>>>>>>>> > > >> buffer is
>>>>>>>> > > >>>>> full/empty, we can probe profile how often are buffer
>>>>>>>> empty/full.
>>>>>>>> > If
>>>>>>>> > > >> output
>>>>>>>> > > >>>>> buffer is full 1% of times, there is almost no back
>>>>>>>> pressure. If
>>>>>>>> > it’s
>>>>>>>> > > >> full
>>>>>>>> > > >>>>> 80% of times, there is some back pressure, if it’s full
>>>>>>>> 99.9% of
>>>>>>>> > > times,
>>>>>>>> > > >>>>> there is huge back pressure.
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> Now for autoscaling you could compare the input & output
>>>>>>>> buffers
>>>>>>>> > fill
>>>>>>>> > > >>>>> ratio:
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> 1. Both are high, the source of bottleneck is down the
>>>>>>>> stream
>>>>>>>> > > >>>>> 2. Output is low, input is high, this is the bottleneck
>>>>>>>> and the
>>>>>>>> > > higher
>>>>>>>> > > >>>>> the difference, the bigger source of bottleneck is this is
>>>>>>>> > > >> operator/task
>>>>>>>> > > >>>>> 3. Output is high, input is low - there was some load
>>>>>>>> spike that we
>>>>>>>> > > are
>>>>>>>> > > >>>>> currently finishing to process
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> But long story short, we are probably diverging from the
>>>>>>>> topic of
>>>>>>>> > > this
>>>>>>>> > > >>>>> discussion, and we can discuss this at some later point.
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> For now, for sources:
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> as I wrote before, +1 for:
>>>>>>>> > > >>>>> - pending.bytes, Gauge
>>>>>>>> > > >>>>> - pending.messages, Gauge
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> When we will be developing/discussing SourceReader from
>>>>>>>> FLIP-27 we
>>>>>>>> > > >> might
>>>>>>>> > > >>>>> then add:
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> - in-memory.buffer.usage (0 - 100%)
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> Which will be estimated automatically by Flink while user
>>>>>>>> will be
>>>>>>>> > > able
>>>>>>>> > > >> to
>>>>>>>> > > >>>>> override/provide better estimation.
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>> Piotrek
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>>> On 5 Jun 2019, at 05:42, Becket Qin <
>>>>>>>> becket....@gmail.com> wrote:
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> Hi Piotr,
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> Thanks for the explanation. Please see some
>>>>>>>> clarifications below.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> By time-based metric, I meant the portion of time spent
>>>>>>>> on
>>>>>>>> > producing
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>> record to downstream. For example, a source connector
>>>>>>>> can report
>>>>>>>> > > that
>>>>>>>> > > >>>>> it's
>>>>>>>> > > >>>>>> spending 80% of time to emit record to downstream
>>>>>>>> processing
>>>>>>>> > > pipeline.
>>>>>>>> > > >>>>> In
>>>>>>>> > > >>>>>> another case, a sink connector may report that its
>>>>>>>> spending 30% of
>>>>>>>> > > >> time
>>>>>>>> > > >>>>>> producing the records to the external system.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> This is in some sense equivalent to the buffer usage
>>>>>>>> metric:
>>>>>>>> > > >>>>>> - 80% of time spent on emitting records to downstream
>>>>>>>> --->
>>>>>>>> > > downstream
>>>>>>>> > > >>>>>> node is bottleneck ---> output buffer is probably full.
>>>>>>>> > > >>>>>> - 30% of time spent on emitting records to downstream
>>>>>>>> --->
>>>>>>>> > > downstream
>>>>>>>> > > >>>>>> node is not bottleneck ---> output buffer is probably
>>>>>>>> not full.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> However, the time-based metric has a few advantages that
>>>>>>>> the
>>>>>>>> > buffer
>>>>>>>> > > >>>>> usage
>>>>>>>> > > >>>>>> metric may not have.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> 1.  Buffer usage metric may not be applicable to all the
>>>>>>>> connector
>>>>>>>> > > >>>>>> implementations, while reporting time-based metric are
>>>>>>>> always
>>>>>>>> > > doable.
>>>>>>>> > > >>>>>> Some source connectors may not have any input buffer, or
>>>>>>>> they may
>>>>>>>> > > use
>>>>>>>> > > >>>>> some
>>>>>>>> > > >>>>>> third party library that does not expose the input
>>>>>>>> buffer at all.
>>>>>>>> > > >>>>>> Similarly, for sink connectors, the implementation may
>>>>>>>> not have
>>>>>>>> > any
>>>>>>>> > > >>>>> output
>>>>>>>> > > >>>>>> buffer, or the third party library does not expose such
>>>>>>>> buffer.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> 2. Although both type of metrics can detect bottleneck,
>>>>>>>> time-based
>>>>>>>> > > >>>>> metrics
>>>>>>>> > > >>>>>> can be used to generate a more informed action to remove
>>>>>>>> the
>>>>>>>> > > >> bottleneck.
>>>>>>>> > > >>>>>> For example, when the downstream is bottleneck, the
>>>>>>>> output buffer
>>>>>>>> > > >> usage
>>>>>>>> > > >>>>>> metric is likely to be 100%, and the input buffer usage
>>>>>>>> might be
>>>>>>>> > 0%.
>>>>>>>> > > >>>>> That
>>>>>>>> > > >>>>>> means we don't know what is the suitable parallelism to
>>>>>>>> lift the
>>>>>>>> > > >>>>>> bottleneck. The time-based metric, on the other hand,
>>>>>>>> would give
>>>>>>>> > > >> useful
>>>>>>>> > > >>>>>> information, e.g. if 80% of time was spent on emitting
>>>>>>>> records, we
>>>>>>>> > > can
>>>>>>>> > > >>>>>> roughly increase the downstream node parallelism by 4
>>>>>>>> times.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> Admittedly, the time-based metrics are more expensive
>>>>>>>> than buffer
>>>>>>>> > > >>>>> usage. So
>>>>>>>> > > >>>>>> we may have to do some sampling to reduce the cost. But
>>>>>>>> in
>>>>>>>> > general,
>>>>>>>> > > >>>>> using
>>>>>>>> > > >>>>>> time-based metrics seems worth adding.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> That being said, I don't think buffer usage metric and
>>>>>>>> time-based
>>>>>>>> > > >>>>> metrics
>>>>>>>> > > >>>>>> are mutually exclusive. We can probably have both. It is
>>>>>>>> just that
>>>>>>>> > > in
>>>>>>>> > > >>>>>> practice, features like auto-scaling might prefer
>>>>>>>> time-based
>>>>>>>> > metrics
>>>>>>>> > > >> for
>>>>>>>> > > >>>>>> the reason stated above.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>>> 1. Define the metrics that would allow us to manually
>>>>>>>> detect
>>>>>>>> > > >>>>> bottlenecks.
>>>>>>>> > > >>>>>> As I wrote, we already have them in most of the places,
>>>>>>>> except of
>>>>>>>> > > >>>>>> sources/sinks.
>>>>>>>> > > >>>>>>> 2. Use those metrics, to automatically detect
>>>>>>>> bottlenecks.
>>>>>>>> > > Currently
>>>>>>>> > > >> we
>>>>>>>> > > >>>>>> are only automatically detecting back pressure and
>>>>>>>> reporting it to
>>>>>>>> > > the
>>>>>>>> > > >>>>> user
>>>>>>>> > > >>>>>> in web UI (is it exposed as a metric at all?). Detecting
>>>>>>>> the root
>>>>>>>> > > >> cause
>>>>>>>> > > >>>>> of
>>>>>>>> > > >>>>>> the back pressure (bottleneck) is one step further.
>>>>>>>> > > >>>>>>> 3. Use the knowledge about where exactly the bottleneck
>>>>>>>> is
>>>>>>>> > located,
>>>>>>>> > > >> to
>>>>>>>> > > >>>>>> try to do something with it.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> As explained above, I think time-based metric also
>>>>>>>> addresses item
>>>>>>>> > 1
>>>>>>>> > > >> and
>>>>>>>> > > >>>>>> item 2.
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> Any thoughts？
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> Thanks,
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski <
>>>>>>>> > pi...@ververica.com>
>>>>>>>> > > >>>>> wrote:
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>>> Hi again :)
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>>> - pending.bytes, Gauge
>>>>>>>> > > >>>>>>>> - pending.messages, Gauge
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> +1
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> And true, instead of overloading one of the metric it
>>>>>>>> is better
>>>>>>>> > > when
>>>>>>>> > > >>>>> user
>>>>>>>> > > >>>>>>> can choose to provide only one of them.
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> Re 2:
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>>> If I understand correctly, this metric along with the
>>>>>>>> pending
>>>>>>>> > > >> mesages
>>>>>>>> > > >>>>> /
>>>>>>>> > > >>>>>>>> bytes would answer the questions of:
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>>> - Does the connector consume fast enough? Lagging
>>>>>>>> behind + empty
>>>>>>>> > > >>>>> buffer
>>>>>>>> > > >>>>>>> =
>>>>>>>> > > >>>>>>>> cannot consume fast enough.
>>>>>>>> > > >>>>>>>> - Does the connector emit fast enough? Lagging behind
>>>>>>>> + full
>>>>>>>> > > buffer
>>>>>>>> > > >> =
>>>>>>>> > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is
>>>>>>>> slow.
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> Yes, exactly. This can also be used to support
>>>>>>>> decisions like
>>>>>>>> > > >> changing
>>>>>>>> > > >>>>> the
>>>>>>>> > > >>>>>>> parallelism of the sources and/or down stream operators.
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> I’m not sure if I understand your proposal with time
>>>>>>>> based
>>>>>>>> > > >>>>> measurements.
>>>>>>>> > > >>>>>>> Maybe I’m missing something, but I do not see how
>>>>>>>> measuring time
>>>>>>>> > > >> alone
>>>>>>>> > > >>>>>>> could answer the problem: where is the bottleneck. Time
>>>>>>>> spent on
>>>>>>>> > > the
>>>>>>>> > > >>>>>>> next/emit might be short or long (depending on how
>>>>>>>> heavy to
>>>>>>>> > process
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>>> record is) and the source can still be bottlenecked/back
>>>>>>>> > pressured
>>>>>>>> > > or
>>>>>>>> > > >>>>> not.
>>>>>>>> > > >>>>>>> Usually the easiest and the most reliable way how to
>>>>>>>> detect
>>>>>>>> > > >>>>> bottlenecks is
>>>>>>>> > > >>>>>>> by checking usage of input & output buffers, since when
>>>>>>>> input
>>>>>>>> > > buffer
>>>>>>>> > > >> is
>>>>>>>> > > >>>>>>> full while output buffer is empty, that’s the
>>>>>>>> definition of a
>>>>>>>> > > >>>>> bottleneck.
>>>>>>>> > > >>>>>>> Also this is usually very easy and cheap to measure (it
>>>>>>>> works
>>>>>>>> > > >>>>> effectively
>>>>>>>> > > >>>>>>> the same way as current’s Flink back pressure
>>>>>>>> monitoring, but
>>>>>>>> > more
>>>>>>>> > > >>>>> cleanly,
>>>>>>>> > > >>>>>>> without probing thread’s stack traces).
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> Also keep in mind that we are already using the buffer
>>>>>>>> usage
>>>>>>>> > > metrics
>>>>>>>> > > >>>>> for
>>>>>>>> > > >>>>>>> detecting the bottlenecks in Flink’s internal network
>>>>>>>> exchanges
>>>>>>>> > > >> (manual
>>>>>>>> > > >>>>>>> work). That’s the reason why I wanted to extend this to
>>>>>>>> > > >> sources/sinks,
>>>>>>>> > > >>>>>>> since they are currently our blind spot.
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>>> One feature we are currently working on to scale Flink
>>>>>>>> > > automatically
>>>>>>>> > > >>>>>>> relies
>>>>>>>> > > >>>>>>>> on some metrics answering the same question
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> That would be very helpful feature. I think in order to
>>>>>>>> achieve
>>>>>>>> > > that
>>>>>>>> > > >> we
>>>>>>>> > > >>>>>>> would need to:
>>>>>>>> > > >>>>>>> 1. Define the metrics that would allow us to manually
>>>>>>>> detect
>>>>>>>> > > >>>>> bottlenecks.
>>>>>>>> > > >>>>>>> As I wrote, we already have them in most of the places,
>>>>>>>> except of
>>>>>>>> > > >>>>>>> sources/sinks.
>>>>>>>> > > >>>>>>> 2. Use those metrics, to automatically detect
>>>>>>>> bottlenecks.
>>>>>>>> > > Currently
>>>>>>>> > > >> we
>>>>>>>> > > >>>>>>> are only automatically detecting back pressure and
>>>>>>>> reporting it
>>>>>>>> > to
>>>>>>>> > > >> the
>>>>>>>> > > >>>>> user
>>>>>>>> > > >>>>>>> in web UI (is it exposed as a metric at all?).
>>>>>>>> Detecting the root
>>>>>>>> > > >>>>> cause of
>>>>>>>> > > >>>>>>> the back pressure (bottleneck) is one step further.
>>>>>>>> > > >>>>>>> 3. Use the knowledge about where exactly the bottleneck
>>>>>>>> is
>>>>>>>> > located,
>>>>>>>> > > >> to
>>>>>>>> > > >>>>> try
>>>>>>>> > > >>>>>>> to do something with it.
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> I think you are aiming for point 3., but before we
>>>>>>>> reach it, we
>>>>>>>> > are
>>>>>>>> > > >>>>> still
>>>>>>>> > > >>>>>>> missing 1. & 2. Also even if we have 3., there is a
>>>>>>>> value in 1 &
>>>>>>>> > 2
>>>>>>>> > > >> for
>>>>>>>> > > >>>>>>> manual analysis/dashboards.
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> However, having the knowledge of where the bottleneck
>>>>>>>> is, doesn’t
>>>>>>>> > > >>>>>>> necessarily mean that we know what we can do about it.
>>>>>>>> For
>>>>>>>> > example
>>>>>>>> > > >>>>>>> increasing parallelism might or might not help with
>>>>>>>> anything
>>>>>>>> > (data
>>>>>>>> > > >>>>> skew,
>>>>>>>> > > >>>>>>> bottleneck on some resource that does not scale), but
>>>>>>>> this remark
>>>>>>>> > > >>>>> applies
>>>>>>>> > > >>>>>>> always, regardless of the way how did we detect the
>>>>>>>> bottleneck.
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>> Piotrek
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>>> On 3 Jun 2019, at 06:16, Becket Qin <
>>>>>>>> becket....@gmail.com>
>>>>>>>> > wrote:
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> Hi Piotr,
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> Thanks for the suggestion. Some thoughts below:
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> Re 1: The pending messages / bytes.
>>>>>>>> > > >>>>>>>> I completely agree these are very useful metrics and
>>>>>>>> we should
>>>>>>>> > > >> expect
>>>>>>>> > > >>>>> the
>>>>>>>> > > >>>>>>>> connector to report. WRT the way to expose them, it
>>>>>>>> seems more
>>>>>>>> > > >>>>> consistent
>>>>>>>> > > >>>>>>>> to add two metrics instead of adding a method (unless
>>>>>>>> there are
>>>>>>>> > > >> other
>>>>>>>> > > >>>>> use
>>>>>>>> > > >>>>>>>> cases other than metric reporting). So we can add the
>>>>>>>> following
>>>>>>>> > > two
>>>>>>>> > > >>>>>>> metrics.
>>>>>>>> > > >>>>>>>> - pending.bytes, Gauge
>>>>>>>> > > >>>>>>>> - pending.messages, Gauge
>>>>>>>> > > >>>>>>>> Applicable connectors can choose to report them. These
>>>>>>>> two
>>>>>>>> > metrics
>>>>>>>> > > >>>>> along
>>>>>>>> > > >>>>>>>> with latency should be sufficient for users to
>>>>>>>> understand the
>>>>>>>> > > >> progress
>>>>>>>> > > >>>>>>> of a
>>>>>>>> > > >>>>>>>> connector.
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> Re 2: Number of buffered data in-memory of the
>>>>>>>> connector
>>>>>>>> > > >>>>>>>> If I understand correctly, this metric along with the
>>>>>>>> pending
>>>>>>>> > > >> mesages
>>>>>>>> > > >>>>> /
>>>>>>>> > > >>>>>>>> bytes would answer the questions of:
>>>>>>>> > > >>>>>>>> - Does the connector consume fast enough? Lagging
>>>>>>>> behind + empty
>>>>>>>> > > >>>>> buffer
>>>>>>>> > > >>>>>>> =
>>>>>>>> > > >>>>>>>> cannot consume fast enough.
>>>>>>>> > > >>>>>>>> - Does the connector emit fast enough? Lagging behind
>>>>>>>> + full
>>>>>>>> > > buffer
>>>>>>>> > > >> =
>>>>>>>> > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is
>>>>>>>> slow.
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> One feature we are currently working on to scale Flink
>>>>>>>> > > automatically
>>>>>>>> > > >>>>>>> relies
>>>>>>>> > > >>>>>>>> on some metrics answering the same question, more
>>>>>>>> specifically,
>>>>>>>> > we
>>>>>>>> > > >> are
>>>>>>>> > > >>>>>>>> profiling the time spent on .next() method (time to
>>>>>>>> consume) and
>>>>>>>> > > the
>>>>>>>> > > >>>>> time
>>>>>>>> > > >>>>>>>> spent on .collect() method (time to emit / process).
>>>>>>>> One
>>>>>>>> > advantage
>>>>>>>> > > >> of
>>>>>>>> > > >>>>>>> such
>>>>>>>> > > >>>>>>>> method level time cost allows us to calculate the
>>>>>>>> parallelism
>>>>>>>> > > >>>>> required to
>>>>>>>> > > >>>>>>>> keep up in case their is a lag.
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> However, one concern I have regarding such metric is
>>>>>>>> that they
>>>>>>>> > are
>>>>>>>> > > >>>>>>>> implementation specific. Either profiling on the
>>>>>>>> method time, or
>>>>>>>> > > >>>>>>> reporting
>>>>>>>> > > >>>>>>>> buffer usage assumes the connector are implemented in
>>>>>>>> such a
>>>>>>>> > way.
>>>>>>>> > > A
>>>>>>>> > > >>>>>>>> slightly better solution might be have the following
>>>>>>>> metric:
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>>  - EmitTimeRatio (or FetchTimeRatio): The time spent
>>>>>>>> on emitting
>>>>>>>> > > >>>>>>>> records / Total time elapsed.
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> This assumes that the source connectors have to emit
>>>>>>>> the records
>>>>>>>> > > to
>>>>>>>> > > >>>>> the
>>>>>>>> > > >>>>>>>> downstream at some point. The emission may take some
>>>>>>>> time ( e.g.
>>>>>>>> > > go
>>>>>>>> > > >>>>>>> through
>>>>>>>> > > >>>>>>>> chained operators). And the rest of the time are spent
>>>>>>>> to
>>>>>>>> > prepare
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>>>> record to emit, including time for consuming and format
>>>>>>>> > > conversion,
>>>>>>>> > > >>>>> etc.
>>>>>>>> > > >>>>>>>> Ideally, we'd like to see the time spent on record
>>>>>>>> fetch and
>>>>>>>> > emit
>>>>>>>> > > to
>>>>>>>> > > >>>>> be
>>>>>>>> > > >>>>>>>> about the same, so no one is bottleneck for the other.
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> The downside of these time based metrics is additional
>>>>>>>> overhead
>>>>>>>> > to
>>>>>>>> > > >> get
>>>>>>>> > > >>>>>>> the
>>>>>>>> > > >>>>>>>> time, therefore sampling might be needed. But in
>>>>>>>> practice I feel
>>>>>>>> > > >> such
>>>>>>>> > > >>>>>>> time
>>>>>>>> > > >>>>>>>> based metric might be more useful if we want to take
>>>>>>>> action.
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> I think we should absolutely add metrics in (1) to the
>>>>>>>> metric
>>>>>>>> > > >>>>> convention.
>>>>>>>> > > >>>>>>>> We could also add the metrics mentioned in (2) if we
>>>>>>>> reach
>>>>>>>> > > consensus
>>>>>>>> > > >>>>> on
>>>>>>>> > > >>>>>>>> that. What do you think?
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski <
>>>>>>>> > > pi...@ververica.com
>>>>>>>> > > >>>
>>>>>>>> > > >>>>>>> wrote:
>>>>>>>> > > >>>>>>>>
>>>>>>>> > > >>>>>>>>> Hey Becket,
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> Re 1a) and 1b) +1 from my side.
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> I’ve discussed this issue:
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us
>>>>>>>> to check
>>>>>>>> > > the
>>>>>>>> > > >>>>> cause
>>>>>>>> > > >>>>>>>>> of
>>>>>>>> > > >>>>>>>>>>>> back pressure:
>>>>>>>> > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes?
>>>>>>>> Or boolean
>>>>>>>> > > >>>>>>>>>>>> hasSomethingl/isEmpty)
>>>>>>>> > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or
>>>>>>>> boolean
>>>>>>>> > > >>>>>>>>>>>> hasSomething/isEmpty)
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> With Nico at some lengths and he also saw the
>>>>>>>> benefits of them.
>>>>>>>> > > We
>>>>>>>> > > >>>>> also
>>>>>>>> > > >>>>>>>>> have more concrete proposal for that.
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> Actually there are two really useful metrics, that we
>>>>>>>> are
>>>>>>>> > missing
>>>>>>>> > > >>>>>>>>> currently:
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> 1. Number of data/records/bytes in the backlog to
>>>>>>>> process. For
>>>>>>>> > > >>>>> example
>>>>>>>> > > >>>>>>>>> remaining number of bytes in unread files. Or pending
>>>>>>>> data in
>>>>>>>> > > Kafka
>>>>>>>> > > >>>>>>> topics.
>>>>>>>> > > >>>>>>>>> 2. Number of buffered data in-memory of the
>>>>>>>> connector, that are
>>>>>>>> > > >>>>> waiting
>>>>>>>> > > >>>>>>> to
>>>>>>>> > > >>>>>>>>> be processed pushed to Flink pipeline.
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> Re 1:
>>>>>>>> > > >>>>>>>>> This would have to be a metric provided directly by a
>>>>>>>> > connector.
>>>>>>>> > > It
>>>>>>>> > > >>>>>>> could
>>>>>>>> > > >>>>>>>>> be an undefined `int`:
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> `int backlog` - estimate of pending work.
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> “Undefined” meaning that it would be up to a
>>>>>>>> connector to
>>>>>>>> > decided
>>>>>>>> > > >>>>>>> whether
>>>>>>>> > > >>>>>>>>> it’s measured in bytes, records, pending files or
>>>>>>>> whatever it
>>>>>>>> > is
>>>>>>>> > > >>>>>>> possible
>>>>>>>> > > >>>>>>>>> to provide by the connector. This is because I assume
>>>>>>>> not every
>>>>>>>> > > >>>>>>> connector
>>>>>>>> > > >>>>>>>>> can provide exact number and for some of them it
>>>>>>>> might be
>>>>>>>> > > >> impossible
>>>>>>>> > > >>>>> to
>>>>>>>> > > >>>>>>>>> provide records number of bytes count.
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> Re 2:
>>>>>>>> > > >>>>>>>>> This metric could be either provided by a connector,
>>>>>>>> or
>>>>>>>> > > calculated
>>>>>>>> > > >>>>>>> crudely
>>>>>>>> > > >>>>>>>>> by Flink:
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> `float bufferUsage` - value from [0.0, 1.0] range
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> Percentage of used in memory buffers, like in Kafka’s
>>>>>>>> handover.
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> It could be crudely implemented by Flink with FLIP-27
>>>>>>>> > > >>>>>>>>> SourceReader#isAvailable. If SourceReader is not
>>>>>>>> available
>>>>>>>> > > reported
>>>>>>>> > > >>>>>>>>> `bufferUsage` could be 0.0. If it is available, it
>>>>>>>> could be
>>>>>>>> > 1.0.
>>>>>>>> > > I
>>>>>>>> > > >>>>> think
>>>>>>>> > > >>>>>>>>> this would be a good enough estimation for most of
>>>>>>>> the use
>>>>>>>> > cases
>>>>>>>> > > >>>>> (that
>>>>>>>> > > >>>>>>>>> could be overloaded and implemented better if
>>>>>>>> desired).
>>>>>>>> > > Especially
>>>>>>>> > > >>>>>>> since we
>>>>>>>> > > >>>>>>>>> are reporting only probed values: if probed values
>>>>>>>> are almost
>>>>>>>> > > >> always
>>>>>>>> > > >>>>>>> “1.0”,
>>>>>>>> > > >>>>>>>>> it would mean that we have a back pressure. If they
>>>>>>>> are almost
>>>>>>>> > > >> always
>>>>>>>> > > >>>>>>>>> “0.0”, there is probably no back pressure at the
>>>>>>>> sources.
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> What do you think about this?
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>> Piotrek
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>>> On 30 May 2019, at 11:41, Becket Qin <
>>>>>>>> becket....@gmail.com>
>>>>>>>> > > >> wrote:
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> Hi all,
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> Thanks a lot for all the feedback and comments. I'd
>>>>>>>> like to
>>>>>>>> > > >> continue
>>>>>>>> > > >>>>>>> the
>>>>>>>> > > >>>>>>>>>> discussion on this FLIP.
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> I updated the FLIP-33 wiki to remove all the
>>>>>>>> histogram metrics
>>>>>>>> > > >> from
>>>>>>>> > > >>>>> the
>>>>>>>> > > >>>>>>>>>> first version of metric convention due to the
>>>>>>>> performance
>>>>>>>> > > concern.
>>>>>>>> > > >>>>> The
>>>>>>>> > > >>>>>>>>> plan
>>>>>>>> > > >>>>>>>>>> is to introduce them later when we have a mechanism
>>>>>>>> to opt
>>>>>>>> > > in/out
>>>>>>>> > > >>>>>>>>> metrics.
>>>>>>>> > > >>>>>>>>>> At that point, users can decide whether they want to
>>>>>>>> pay the
>>>>>>>> > > cost
>>>>>>>> > > >> to
>>>>>>>> > > >>>>>>> get
>>>>>>>> > > >>>>>>>>>> the metric or not.
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> As Stephan suggested, for this FLIP, let's first try
>>>>>>>> to agree
>>>>>>>> > on
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>>>>> small
>>>>>>>> > > >>>>>>>>>> list of conventional metrics that connectors should
>>>>>>>> follow.
>>>>>>>> > > >>>>>>>>>> Just to be clear, the purpose of the convention is
>>>>>>>> not to
>>>>>>>> > > enforce
>>>>>>>> > > >>>>> every
>>>>>>>> > > >>>>>>>>>> connector to report all these metrics, but to
>>>>>>>> provide a
>>>>>>>> > guidance
>>>>>>>> > > >> in
>>>>>>>> > > >>>>>>> case
>>>>>>>> > > >>>>>>>>>> these metrics are reported by some connectors.
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> @ Stephan & Chesnay,
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> Regarding the duplication of `RecordsIn` metric in
>>>>>>>> operator /
>>>>>>>> > > task
>>>>>>>> > > >>>>>>>>>> IOMetricGroups, from what I understand, for source
>>>>>>>> operator,
>>>>>>>> > it
>>>>>>>> > > is
>>>>>>>> > > >>>>>>>>> actually
>>>>>>>> > > >>>>>>>>>> the SourceFunction that reports the operator level
>>>>>>>> > > >>>>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are
>>>>>>>> essentially
>>>>>>>> > the
>>>>>>>> > > >>>>> same
>>>>>>>> > > >>>>>>>>>> metric in the operator level IOMetricGroup.
>>>>>>>> Similarly for the
>>>>>>>> > > Sink
>>>>>>>> > > >>>>>>>>>> operator, the operator level
>>>>>>>> RecordsOut/RecordsOutPerSecond
>>>>>>>> > > >> metrics
>>>>>>>> > > >>>>> are
>>>>>>>> > > >>>>>>>>>> also reported by the Sink function. I marked them as
>>>>>>>> existing
>>>>>>>> > in
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>>>>>> FLIP-33 wiki page. Please let me know if I
>>>>>>>> misunderstood.
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin <
>>>>>>>> > > becket....@gmail.com>
>>>>>>>> > > >>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>> Hi Piotr,
>>>>>>>> > > >>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>> > > >>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>> 1a) I guess you are referring to the part that
>>>>>>>> "original
>>>>>>>> > system
>>>>>>>> > > >>>>>>> specific
>>>>>>>> > > >>>>>>>>>>> metrics should also be reported". The performance
>>>>>>>> impact
>>>>>>>> > > depends
>>>>>>>> > > >> on
>>>>>>>> > > >>>>>>> the
>>>>>>>> > > >>>>>>>>>>> implementation. An efficient implementation would
>>>>>>>> only record
>>>>>>>> > > the
>>>>>>>> > > >>>>>>> metric
>>>>>>>> > > >>>>>>>>>>> once, but report them with two different metric
>>>>>>>> names. This
>>>>>>>> > is
>>>>>>>> > > >>>>>>> unlikely
>>>>>>>> > > >>>>>>>>> to
>>>>>>>> > > >>>>>>>>>>> hurt performance.
>>>>>>>> > > >>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>> 1b) Yes, I agree that we should avoid adding
>>>>>>>> overhead to the
>>>>>>>> > > >>>>> critical
>>>>>>>> > > >>>>>>>>> path
>>>>>>>> > > >>>>>>>>>>> by all means. This is sometimes a tradeoff, running
>>>>>>>> blindly
>>>>>>>> > > >> without
>>>>>>>> > > >>>>>>> any
>>>>>>>> > > >>>>>>>>>>> metric gives best performance, but sometimes might
>>>>>>>> be
>>>>>>>> > > frustrating
>>>>>>>> > > >>>>> when
>>>>>>>> > > >>>>>>>>> we
>>>>>>>> > > >>>>>>>>>>> debug some issues.
>>>>>>>> > > >>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>> 2. The metrics are indeed very useful. Are they
>>>>>>>> supposed to
>>>>>>>> > be
>>>>>>>> > > >>>>>>> reported
>>>>>>>> > > >>>>>>>>> by
>>>>>>>> > > >>>>>>>>>>> the connectors or Flink itself? At this point
>>>>>>>> FLIP-33 is more
>>>>>>>> > > >>>>> focused
>>>>>>>> > > >>>>>>> on
>>>>>>>> > > >>>>>>>>>>> provide a guidance to the connector authors on the
>>>>>>>> metrics
>>>>>>>> > > >>>>> reporting.
>>>>>>>> > > >>>>>>>>> That
>>>>>>>> > > >>>>>>>>>>> said, after FLIP-27, I think we should absolutely
>>>>>>>> report
>>>>>>>> > these
>>>>>>>> > > >>>>> metrics
>>>>>>>> > > >>>>>>>>> in
>>>>>>>> > > >>>>>>>>>>> the abstract implementation. In any case, the metric
>>>>>>>> > convention
>>>>>>>> > > >> in
>>>>>>>> > > >>>>>>> this
>>>>>>>> > > >>>>>>>>>>> list are expected to evolve over time.
>>>>>>>> > > >>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski <
>>>>>>>> > > >>>>> pi...@ververica.com>
>>>>>>>> > > >>>>>>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> Hi,
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> Thanks for the proposal and driving the effort
>>>>>>>> here Becket
>>>>>>>> > :)
>>>>>>>> > > >> I’ve
>>>>>>>> > > >>>>>>> read
>>>>>>>> > > >>>>>>>>>>>> through the FLIP-33 [1], and here are couple of my
>>>>>>>> thoughts.
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> Big +1 for standardising the metric names between
>>>>>>>> > connectors,
>>>>>>>> > > it
>>>>>>>> > > >>>>> will
>>>>>>>> > > >>>>>>>>>>>> definitely help us and users a lot.
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> Issues/questions/things to discuss that I’ve
>>>>>>>> thought of:
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> 1a. If we are about to duplicate some metrics, can
>>>>>>>> this
>>>>>>>> > > become a
>>>>>>>> > > >>>>>>>>>>>> performance issue?
>>>>>>>> > > >>>>>>>>>>>> 1b. Generally speaking, we should make sure that
>>>>>>>> collecting
>>>>>>>> > > >> those
>>>>>>>> > > >>>>>>>>> metrics
>>>>>>>> > > >>>>>>>>>>>> is as non intrusive as possible, especially that
>>>>>>>> they will
>>>>>>>> > > need
>>>>>>>> > > >>>>> to be
>>>>>>>> > > >>>>>>>>>>>> updated once per record. (They might be collected
>>>>>>>> more
>>>>>>>> > rarely
>>>>>>>> > > >> with
>>>>>>>> > > >>>>>>> some
>>>>>>>> > > >>>>>>>>>>>> overhead, but the hot path of updating it per
>>>>>>>> record will
>>>>>>>> > need
>>>>>>>> > > >> to
>>>>>>>> > > >>>>> be
>>>>>>>> > > >>>>>>> as
>>>>>>>> > > >>>>>>>>>>>> quick as possible). That includes both avoiding
>>>>>>>> heavy
>>>>>>>> > > >> computation
>>>>>>>> > > >>>>> on
>>>>>>>> > > >>>>>>>>> per
>>>>>>>> > > >>>>>>>>>>>> record path: histograms?, measuring time for time
>>>>>>>> based
>>>>>>>> > > metrics
>>>>>>>> > > >>>>> (per
>>>>>>>> > > >>>>>>>>>>>> second) (System.currentTimeMillis() depending on
>>>>>>>> the
>>>>>>>> > > >>>>> implementation
>>>>>>>> > > >>>>>>> can
>>>>>>>> > > >>>>>>>>>>>> invoke a system call)
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us
>>>>>>>> to check
>>>>>>>> > > the
>>>>>>>> > > >>>>> cause
>>>>>>>> > > >>>>>>>>> of
>>>>>>>> > > >>>>>>>>>>>> back pressure:
>>>>>>>> > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes?
>>>>>>>> Or boolean
>>>>>>>> > > >>>>>>>>>>>> hasSomethingl/isEmpty)
>>>>>>>> > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or
>>>>>>>> boolean
>>>>>>>> > > >>>>>>>>>>>> hasSomething/isEmpty)
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> a) is useful in a scenario when we are processing
>>>>>>>> backlog of
>>>>>>>> > > >>>>> records,
>>>>>>>> > > >>>>>>>>> all
>>>>>>>> > > >>>>>>>>>>>> of the internal Flink’s input/output network
>>>>>>>> buffers are
>>>>>>>> > > empty,
>>>>>>>> > > >>>>> and
>>>>>>>> > > >>>>>>> we
>>>>>>>> > > >>>>>>>>> want
>>>>>>>> > > >>>>>>>>>>>> to check whether the external source system is the
>>>>>>>> > bottleneck
>>>>>>>> > > >>>>>>> (source’s
>>>>>>>> > > >>>>>>>>>>>> input queue will be empty), or if the Flink’s
>>>>>>>> connector is
>>>>>>>> > the
>>>>>>>> > > >>>>>>>>> bottleneck
>>>>>>>> > > >>>>>>>>>>>> (source’s input queues will be full).
>>>>>>>> > > >>>>>>>>>>>> b) similar story. Backlog of records, but this
>>>>>>>> time all of
>>>>>>>> > the
>>>>>>>> > > >>>>>>> internal
>>>>>>>> > > >>>>>>>>>>>> Flink’s input/ouput network buffers are full, and
>>>>>>>> we want o
>>>>>>>> > > >> check
>>>>>>>> > > >>>>>>>>> whether
>>>>>>>> > > >>>>>>>>>>>> the external sink system is the bottleneck (sink
>>>>>>>> output
>>>>>>>> > queues
>>>>>>>> > > >> are
>>>>>>>> > > >>>>>>>>> full),
>>>>>>>> > > >>>>>>>>>>>> or if the Flink’s connector is the bottleneck
>>>>>>>> (sink’s output
>>>>>>>> > > >>>>> queues
>>>>>>>> > > >>>>>>>>> will be
>>>>>>>> > > >>>>>>>>>>>> empty)
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> It might be sometimes difficult to provide those
>>>>>>>> metrics, so
>>>>>>>> > > >> they
>>>>>>>> > > >>>>>>> could
>>>>>>>> > > >>>>>>>>>>>> be optional, but if we could provide them, it
>>>>>>>> would be
>>>>>>>> > really
>>>>>>>> > > >>>>>>> helpful.
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> Piotrek
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>> [1]
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>
>>>>>>>> > > >>
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics
>>>>>>>> > > >>>>>>>>>>>> <
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>
>>>>>>>> > > >>
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen <
>>>>>>>> se...@apache.org>
>>>>>>>> > > >> wrote:
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>> I think this sounds reasonable.
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>> Let's keep the "reconfiguration without stopping
>>>>>>>> the job"
>>>>>>>> > out
>>>>>>>> > > >> of
>>>>>>>> > > >>>>>>> this,
>>>>>>>> > > >>>>>>>>>>>>> because that would be a super big effort and if
>>>>>>>> we approach
>>>>>>>> > > >> that,
>>>>>>>> > > >>>>>>> then
>>>>>>>> > > >>>>>>>>>>>> in
>>>>>>>> > > >>>>>>>>>>>>> more generic way rather than specific to
>>>>>>>> connector metrics.
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>> I would suggest to look at the following things
>>>>>>>> before
>>>>>>>> > > starting
>>>>>>>> > > >>>>> with
>>>>>>>> > > >>>>>>>>> any
>>>>>>>> > > >>>>>>>>>>>>> implementation work:
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>> - Try and find a committer to support this,
>>>>>>>> otherwise it
>>>>>>>> > will
>>>>>>>> > > >> be
>>>>>>>> > > >>>>>>> hard
>>>>>>>> > > >>>>>>>>>>>> to
>>>>>>>> > > >>>>>>>>>>>>> make progress
>>>>>>>> > > >>>>>>>>>>>>> - Start with defining a smaller set of "core
>>>>>>>> metrics" and
>>>>>>>> > > >> extend
>>>>>>>> > > >>>>> the
>>>>>>>> > > >>>>>>>>>>>> set
>>>>>>>> > > >>>>>>>>>>>>> later. I think that is easier than now blocking
>>>>>>>> on reaching
>>>>>>>> > > >>>>>>> consensus
>>>>>>>> > > >>>>>>>>>>>> on a
>>>>>>>> > > >>>>>>>>>>>>> large group of metrics.
>>>>>>>> > > >>>>>>>>>>>>> - Find a solution to the problem Chesnay
>>>>>>>> mentioned, that
>>>>>>>> > the
>>>>>>>> > > >>>>>>> "records
>>>>>>>> > > >>>>>>>>>>>> in"
>>>>>>>> > > >>>>>>>>>>>>> metric is somehow overloaded and exists already
>>>>>>>> in the IO
>>>>>>>> > > >> Metric
>>>>>>>> > > >>>>>>>>> group.
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin <
>>>>>>>> > > >> becket....@gmail.com
>>>>>>>> > > >>>>>>
>>>>>>>> > > >>>>>>>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> Hi Stephan,
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> Thanks a lot for the feedback. All makes sense.
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> It is a good suggestion to simply have an
>>>>>>>> > onRecord(numBytes,
>>>>>>>> > > >>>>>>>>> eventTime)
>>>>>>>> > > >>>>>>>>>>>>>> method for connector writers. It should meet
>>>>>>>> most of the
>>>>>>>> > > >>>>>>>>> requirements,
>>>>>>>> > > >>>>>>>>>>>>>> individual
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> The configurable metrics feature is something
>>>>>>>> really
>>>>>>>> > useful,
>>>>>>>> > > >>>>>>>>>>>> especially if
>>>>>>>> > > >>>>>>>>>>>>>> we can somehow make it dynamically configurable
>>>>>>>> without
>>>>>>>> > > >> stopping
>>>>>>>> > > >>>>>>> the
>>>>>>>> > > >>>>>>>>>>>> jobs.
>>>>>>>> > > >>>>>>>>>>>>>> It might be better to make it a separate
>>>>>>>> discussion
>>>>>>>> > because
>>>>>>>> > > it
>>>>>>>> > > >>>>> is a
>>>>>>>> > > >>>>>>>>>>>> more
>>>>>>>> > > >>>>>>>>>>>>>> generic feature instead of only for connectors.
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> So in order to make some progress, in this FLIP
>>>>>>>> we can
>>>>>>>> > limit
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>>>>>>>> discussion
>>>>>>>> > > >>>>>>>>>>>>>> scope to the connector related items:
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> - the standard connector metric names and types.
>>>>>>>> > > >>>>>>>>>>>>>> - the abstract ConnectorMetricHandler interface
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> I'll start a separate thread to discuss other
>>>>>>>> general
>>>>>>>> > metric
>>>>>>>> > > >>>>>>> related
>>>>>>>> > > >>>>>>>>>>>>>> enhancement items including:
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> - optional metrics
>>>>>>>> > > >>>>>>>>>>>>>> - dynamic metric configuration
>>>>>>>> > > >>>>>>>>>>>>>> - potential combination with rate limiter
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> Does this plan sound reasonable?
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen <
>>>>>>>> > > >> se...@apache.org>
>>>>>>>> > > >>>>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> Ignoring for a moment implementation details,
>>>>>>>> this
>>>>>>>> > > connector
>>>>>>>> > > >>>>>>> metrics
>>>>>>>> > > >>>>>>>>>>>> work
>>>>>>>> > > >>>>>>>>>>>>>>> is a really good thing to do, in my opinion
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> The questions "oh, my job seems to be doing
>>>>>>>> nothing, I am
>>>>>>>> > > >>>>> looking
>>>>>>>> > > >>>>>>> at
>>>>>>>> > > >>>>>>>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>> UI
>>>>>>>> > > >>>>>>>>>>>>>>> and the 'records in' value is still zero" is in
>>>>>>>> the top
>>>>>>>> > > three
>>>>>>>> > > >>>>>>>>> support
>>>>>>>> > > >>>>>>>>>>>>>>> questions I have been asked personally.
>>>>>>>> > > >>>>>>>>>>>>>>> Introspection into "how far is the consumer
>>>>>>>> lagging
>>>>>>>> > behind"
>>>>>>>> > > >>>>> (event
>>>>>>>> > > >>>>>>>>>>>> time
>>>>>>>> > > >>>>>>>>>>>>>>> fetch latency) came up many times as well.
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> So big +1 to solving this problem.
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> About the exact design - I would try to go for
>>>>>>>> the
>>>>>>>> > > following
>>>>>>>> > > >>>>>>>>>>>> properties:
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> - keep complexity of of connectors. Ideally the
>>>>>>>> metrics
>>>>>>>> > > >> handler
>>>>>>>> > > >>>>>>> has
>>>>>>>> > > >>>>>>>>> a
>>>>>>>> > > >>>>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or
>>>>>>>> so, and
>>>>>>>> > > >>>>> everything
>>>>>>>> > > >>>>>>>>>>>> else is
>>>>>>>> > > >>>>>>>>>>>>>>> internal to the handler. That makes it dead
>>>>>>>> simple for
>>>>>>>> > the
>>>>>>>> > > >>>>>>>>> connector.
>>>>>>>> > > >>>>>>>>>>>> We
>>>>>>>> > > >>>>>>>>>>>>>>> can also think of an extensive scheme for
>>>>>>>> connector
>>>>>>>> > > specific
>>>>>>>> > > >>>>>>>>> metrics.
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> - make it configurable on the job it cluster
>>>>>>>> level which
>>>>>>>> > > >>>>> metrics
>>>>>>>> > > >>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>>> handler internally creates when that method is
>>>>>>>> invoked.
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> What do you think?
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> Best,
>>>>>>>> > > >>>>>>>>>>>>>>> Stephan
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay
>>>>>>>> Schepler <
>>>>>>>> > > >>>>>>>>> ches...@apache.org
>>>>>>>> > > >>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>> As I said before, I believe this to be
>>>>>>>> over-engineered
>>>>>>>> > and
>>>>>>>> > > >>>>> have
>>>>>>>> > > >>>>>>> no
>>>>>>>> > > >>>>>>>>>>>>>>>> interest in this implementation.
>>>>>>>> > > >>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>> There are conceptual issues like defining a
>>>>>>>> duplicate
>>>>>>>> > > >>>>>>>>>>>>>> numBytesIn(PerSec)
>>>>>>>> > > >>>>>>>>>>>>>>>> metric that already exists for each operator.
>>>>>>>> > > >>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>> A few updates to the thread. I uploaded a
>>>>>>>> patch[1] as a
>>>>>>>> > > >>>>> complete
>>>>>>>> > > >>>>>>>>>>>>>>>>> example of how users can use the metrics in
>>>>>>>> the future.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Some thoughts below after taking a look at the
>>>>>>>> > > >>>>>>> AbstractMetricGroup
>>>>>>>> > > >>>>>>>>>>>>>> and
>>>>>>>> > > >>>>>>>>>>>>>>>>> its subclasses.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> This patch intends to provide convenience for
>>>>>>>> Flink
>>>>>>>> > > >> connector
>>>>>>>> > > >>>>>>>>>>>>>>>>> implementations to follow metrics standards
>>>>>>>> proposed in
>>>>>>>> > > >>>>> FLIP-33.
>>>>>>>> > > >>>>>>>>> It
>>>>>>>> > > >>>>>>>>>>>>>>>>> also try to enhance the metric management in
>>>>>>>> general
>>>>>>>> > way
>>>>>>>> > > to
>>>>>>>> > > >>>>> help
>>>>>>>> > > >>>>>>>>>>>>>> users
>>>>>>>> > > >>>>>>>>>>>>>>>>> with:
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> 1. metric definition
>>>>>>>> > > >>>>>>>>>>>>>>>>> 2. metric dependencies check
>>>>>>>> > > >>>>>>>>>>>>>>>>> 3. metric validation
>>>>>>>> > > >>>>>>>>>>>>>>>>> 4. metric control (turn on / off particular
>>>>>>>> metrics)
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the
>>>>>>>> > > functionality
>>>>>>>> > > >> of
>>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses. The
>>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the
>>>>>>>> metric group
>>>>>>>> > > >>>>>>> hierarchy,
>>>>>>>> > > >>>>>>>>>>>> but
>>>>>>>> > > >>>>>>>>>>>>>>>>> does not really manage the metrics other than
>>>>>>>> keeping
>>>>>>>> > > them
>>>>>>>> > > >>>>> in a
>>>>>>>> > > >>>>>>>>> Map.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Ideally we should only have one entry point
>>>>>>>> for the
>>>>>>>> > > >> metrics.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Right now the entry point is
>>>>>>>> |AbstractMetricGroup|.
>>>>>>>> > > >> However,
>>>>>>>> > > >>>>>>>>> besides
>>>>>>>> > > >>>>>>>>>>>>>>>>> the missing functionality mentioned above,
>>>>>>>> > > >>>>> |AbstractMetricGroup|
>>>>>>>> > > >>>>>>>>>>>>>> seems
>>>>>>>> > > >>>>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could
>>>>>>>> extract it out
>>>>>>>> > > to
>>>>>>>> > > >>>>>>>>>>>>>>>>> flink-metrics in order to use it for generic
>>>>>>>> purpose.
>>>>>>>> > > There
>>>>>>>> > > >>>>> will
>>>>>>>> > > >>>>>>>>> be
>>>>>>>> > > >>>>>>>>>>>>>>>>> some work, though.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Another approach is to make |AbstractMetrics|
>>>>>>>> in this
>>>>>>>> > > patch
>>>>>>>> > > >>>>> as
>>>>>>>> > > >>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>>>>> metric entry point. It wraps metric group and
>>>>>>>> provides
>>>>>>>> > > the
>>>>>>>> > > >>>>>>> missing
>>>>>>>> > > >>>>>>>>>>>>>>>>> functionalities. Then we can roll out this
>>>>>>>> pattern to
>>>>>>>> > > >> runtime
>>>>>>>> > > >>>>>>>>>>>>>>>>> components gradually as well.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> My first thought is that the latter approach
>>>>>>>> gives a
>>>>>>>> > more
>>>>>>>> > > >>>>> smooth
>>>>>>>> > > >>>>>>>>>>>>>>>>> migration. But I am also OK with doing a
>>>>>>>> refactoring on
>>>>>>>> > > the
>>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| family.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> [1] https://github.com/becketqin/flink/pull/1
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin <
>>>>>>>> > > >>>>>>> becket....@gmail.com
>>>>>>>> > > >>>>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Hi Chesnay,
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> It might be easier to discuss some
>>>>>>>> implementation
>>>>>>>> > details
>>>>>>>> > > >> in
>>>>>>>> > > >>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>>>>> PR review instead of in the FLIP discussion
>>>>>>>> thread. I
>>>>>>>> > > have
>>>>>>>> > > >> a
>>>>>>>> > > >>>>>>>>>>>>>> patch
>>>>>>>> > > >>>>>>>>>>>>>>>>> for Kafka connectors ready but haven't
>>>>>>>> submitted the PR
>>>>>>>> > > >> yet.
>>>>>>>> > > >>>>>>>>>>>>>>>>> Hopefully that will help explain a bit more.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> ** Re: metric type binding
>>>>>>>> > > >>>>>>>>>>>>>>>>> This is a valid point that worths discussing.
>>>>>>>> If I
>>>>>>>> > > >> understand
>>>>>>>> > > >>>>>>>>>>>>>>>>> correctly, there are two points:
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> 1. Metric type / interface does not matter as
>>>>>>>> long as
>>>>>>>> > the
>>>>>>>> > > >>>>>>> metric
>>>>>>>> > > >>>>>>>>>>>>>>>>> semantic is clearly defined.
>>>>>>>> > > >>>>>>>>>>>>>>>>> Conceptually speaking, I agree that as long
>>>>>>>> as the
>>>>>>>> > metric
>>>>>>>> > > >>>>>>>>>>>>>> semantic
>>>>>>>> > > >>>>>>>>>>>>>>>>> is defined, metric type does not matter. To
>>>>>>>> some
>>>>>>>> > extent,
>>>>>>>> > > >>>>> Gauge
>>>>>>>> > > >>>>>>> /
>>>>>>>> > > >>>>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can be
>>>>>>>> think of
>>>>>>>> > as
>>>>>>>> > > >>>>> some
>>>>>>>> > > >>>>>>>>>>>>>>>>> well-recognized semantics, if you wish. In
>>>>>>>> Flink, these
>>>>>>>> > > >>>>> metric
>>>>>>>> > > >>>>>>>>>>>>>>>>> semantics have their associated interface
>>>>>>>> classes. In
>>>>>>>> > > >>>>> practice,
>>>>>>>> > > >>>>>>>>>>>>>>>>> such semantic to interface binding seems
>>>>>>>> necessary for
>>>>>>>> > > >>>>>>> different
>>>>>>>> > > >>>>>>>>>>>>>>>>> components to communicate.  Simply
>>>>>>>> standardize the
>>>>>>>> > > semantic
>>>>>>>> > > >>>>> of
>>>>>>>> > > >>>>>>>>>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>>>>> connector metrics seems not sufficient for
>>>>>>>> people to
>>>>>>>> > > build
>>>>>>>> > > >>>>>>>>>>>>>>>>> ecosystem on top of. At the end of the day,
>>>>>>>> we still
>>>>>>>> > need
>>>>>>>> > > >> to
>>>>>>>> > > >>>>>>>>> have
>>>>>>>> > > >>>>>>>>>>>>>>>>> some embodiment of the metric semantics that
>>>>>>>> people can
>>>>>>>> > > >>>>> program
>>>>>>>> > > >>>>>>>>>>>>>>>>> against.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be
>>>>>>>> exposed
>>>>>>>> > > using
>>>>>>>> > > >>>>>>>>>>>>>>>>> different metric types / interfaces.
>>>>>>>> > > >>>>>>>>>>>>>>>>> This is a good point. Counter and
>>>>>>>> Gauge-as-a-Counter
>>>>>>>> > are
>>>>>>>> > > >>>>> pretty
>>>>>>>> > > >>>>>>>>>>>>>>>>> much interchangeable. This is more of a
>>>>>>>> trade-off
>>>>>>>> > between
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>>>>>>>>>> user
>>>>>>>> > > >>>>>>>>>>>>>>>>> experience of metric producers and consumers.
>>>>>>>> The
>>>>>>>> > metric
>>>>>>>> > > >>>>>>>>>>>>>> producers
>>>>>>>> > > >>>>>>>>>>>>>>>>> want to use Counter or Gauge depending on
>>>>>>>> whether the
>>>>>>>> > > >> counter
>>>>>>>> > > >>>>>>> is
>>>>>>>> > > >>>>>>>>>>>>>>>>> already tracked in code, while ideally the
>>>>>>>> metric
>>>>>>>> > > consumers
>>>>>>>> > > >>>>>>> only
>>>>>>>> > > >>>>>>>>>>>>>>>>> want to see a single metric type for each
>>>>>>>> metric. I am
>>>>>>>> > > >>>>> leaning
>>>>>>>> > > >>>>>>>>>>>>>>>>> towards to make the metric producers happy,
>>>>>>>> i.e. allow
>>>>>>>> > > >> Gauge
>>>>>>>> > > >>>>> /
>>>>>>>> > > >>>>>>>>>>>>>>>>> Counter metric type, and the the metric
>>>>>>>> consumers
>>>>>>>> > handle
>>>>>>>> > > >> the
>>>>>>>> > > >>>>>>>>> type
>>>>>>>> > > >>>>>>>>>>>>>>>>> variation. The reason is that in practice,
>>>>>>>> there might
>>>>>>>> > be
>>>>>>>> > > >>>>> more
>>>>>>>> > > >>>>>>>>>>>>>>>>> connector implementations than metric reporter
>>>>>>>> > > >>>>> implementations.
>>>>>>>> > > >>>>>>>>>>>>>> We
>>>>>>>> > > >>>>>>>>>>>>>>>>> could also provide some helper method to
>>>>>>>> facilitate
>>>>>>>> > > reading
>>>>>>>> > > >>>>>>> from
>>>>>>>> > > >>>>>>>>>>>>>>>>> such variable metric type.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Just some quick replies to the comments around
>>>>>>>> > > >> implementation
>>>>>>>> > > >>>>>>>>>>>>>>>> details.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   4) single place where metrics are
>>>>>>>> registered except
>>>>>>>> > > >>>>>>>>>>>>>>>>>   connector-specific
>>>>>>>> > > >>>>>>>>>>>>>>>>>   ones (which we can't really avoid).
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Register connector specific ones in a single
>>>>>>>> place is
>>>>>>>> > > >>>>> actually
>>>>>>>> > > >>>>>>>>>>>>>>>>> something that I want to achieve.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   2) I'm talking about time-series databases
>>>>>>>> like
>>>>>>>> > > >>>>> Prometheus.
>>>>>>>> > > >>>>>>>>>>>>>> We
>>>>>>>> > > >>>>>>>>>>>>>>>>>   would
>>>>>>>> > > >>>>>>>>>>>>>>>>>   only have a gauge metric exposing the last
>>>>>>>> > > >>>>>>>>> fetchTime/emitTime
>>>>>>>> > > >>>>>>>>>>>>>>>>>   that is
>>>>>>>> > > >>>>>>>>>>>>>>>>>   regularly reported to the backend
>>>>>>>> (Prometheus),
>>>>>>>> > where a
>>>>>>>> > > >>>>>>> user
>>>>>>>> > > >>>>>>>>>>>>>>>>>   could build
>>>>>>>> > > >>>>>>>>>>>>>>>>>   a histogram of his choosing when/if he
>>>>>>>> wants it.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Not sure if such downsampling works. As an
>>>>>>>> example, if
>>>>>>>> > a
>>>>>>>> > > >> user
>>>>>>>> > > >>>>>>>>>>>>>>>>> complains that there are some intermittent
>>>>>>>> latency
>>>>>>>> > spikes
>>>>>>>> > > >>>>>>> (maybe
>>>>>>>> > > >>>>>>>>>>>>>> a
>>>>>>>> > > >>>>>>>>>>>>>>>>> few records in 10 seconds) in their
>>>>>>>> processing system.
>>>>>>>> > > >>>>> Having a
>>>>>>>> > > >>>>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems
>>>>>>>> unlikely
>>>>>>>> > > useful.
>>>>>>>> > > >>>>>>>>>>>>>>>>> However by looking at actual 99.9 percentile
>>>>>>>> latency
>>>>>>>> > > might
>>>>>>>> > > >>>>>>> help.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay
>>>>>>>> Schepler
>>>>>>>> > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:
>>>>>>>> ches...@apache.org>>
>>>>>>>> > wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Re: over complication of implementation.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   I think I get understand better know what
>>>>>>>> you're
>>>>>>>> > > >> shooting
>>>>>>>> > > >>>>>>>>>>>>>> for,
>>>>>>>> > > >>>>>>>>>>>>>>>>>   effectively something like the
>>>>>>>> OperatorIOMetricGroup.
>>>>>>>> > > >>>>>>>>>>>>>>>>>   But still, re-define
>>>>>>>> setupConnectorMetrics() to
>>>>>>>> > accept
>>>>>>>> > > a
>>>>>>>> > > >>>>>>> set
>>>>>>>> > > >>>>>>>>>>>>>>>>>   of flags
>>>>>>>> > > >>>>>>>>>>>>>>>>>   for counters/meters(ans _possibly_
>>>>>>>> histograms) along
>>>>>>>> > > >>>>> with a
>>>>>>>> > > >>>>>>>>>>>>>>>>>   set of
>>>>>>>> > > >>>>>>>>>>>>>>>>>   well-defined Optional<Gauge<?>>, and return
>>>>>>>> the
>>>>>>>> > group.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Solves all issues as far as i can tell:
>>>>>>>> > > >>>>>>>>>>>>>>>>>   1) no metrics must be created manually
>>>>>>>> (except
>>>>>>>> > Gauges,
>>>>>>>> > > >>>>>>> which
>>>>>>>> > > >>>>>>>>>>>>>>> are
>>>>>>>> > > >>>>>>>>>>>>>>>>>   effectively just Suppliers and you can't
>>>>>>>> get around
>>>>>>>> > > >>>>> this),
>>>>>>>> > > >>>>>>>>>>>>>>>>>   2) additional metrics can be registered on
>>>>>>>> the
>>>>>>>> > returned
>>>>>>>> > > >>>>>>>>>>>>>> group,
>>>>>>>> > > >>>>>>>>>>>>>>>>>   3) see 1),
>>>>>>>> > > >>>>>>>>>>>>>>>>>   4) single place where metrics are
>>>>>>>> registered except
>>>>>>>> > > >>>>>>>>>>>>>>>>>   connector-specific
>>>>>>>> > > >>>>>>>>>>>>>>>>>   ones (which we can't really avoid).
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Re: Histogram
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   1) As an example, whether "numRecordsIn" is
>>>>>>>> exposed
>>>>>>>> > as
>>>>>>>> > > a
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Counter or a
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Gauge should be irrelevant. So far we're
>>>>>>>> using the
>>>>>>>> > > >> metric
>>>>>>>> > > >>>>>>>>>>>>>> type
>>>>>>>> > > >>>>>>>>>>>>>>>>>   that is
>>>>>>>> > > >>>>>>>>>>>>>>>>>   the most convenient at exposing a given
>>>>>>>> value. If
>>>>>>>> > there
>>>>>>>> > > >>>>> is
>>>>>>>> > > >>>>>>>>>>>>>>>>>   some backing
>>>>>>>> > > >>>>>>>>>>>>>>>>>   data-structure that we want to expose some
>>>>>>>> data from
>>>>>>>> > we
>>>>>>>> > > >>>>>>>>>>>>>>>>>   typically opt
>>>>>>>> > > >>>>>>>>>>>>>>>>>   for a Gauge, as otherwise we're just
>>>>>>>> mucking around
>>>>>>>> > > with
>>>>>>>> > > >>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Meter/Counter API to get it to match.
>>>>>>>> Similarly, if
>>>>>>>> > we
>>>>>>>> > > >>>>> want
>>>>>>>> > > >>>>>>>>>>>>>> to
>>>>>>>> > > >>>>>>>>>>>>>>>>>   count
>>>>>>>> > > >>>>>>>>>>>>>>>>>   something but no current count exists we
>>>>>>>> typically
>>>>>>>> > > added
>>>>>>>> > > >>>>> a
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Counter.
>>>>>>>> > > >>>>>>>>>>>>>>>>>   That's why attaching semantics to metric
>>>>>>>> types makes
>>>>>>>> > > >>>>> little
>>>>>>>> > > >>>>>>>>>>>>>>>>>   sense (but
>>>>>>>> > > >>>>>>>>>>>>>>>>>   unfortunately several reporters already do
>>>>>>>> it); for
>>>>>>>> > > >>>>>>>>>>>>>>>>>   counters/meters
>>>>>>>> > > >>>>>>>>>>>>>>>>>   certainly, but the majority of metrics are
>>>>>>>> gauges.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   2) I'm talking about time-series databases
>>>>>>>> like
>>>>>>>> > > >>>>> Prometheus.
>>>>>>>> > > >>>>>>>>>>>>>> We
>>>>>>>> > > >>>>>>>>>>>>>>>>>   would
>>>>>>>> > > >>>>>>>>>>>>>>>>>   only have a gauge metric exposing the last
>>>>>>>> > > >>>>>>>>> fetchTime/emitTime
>>>>>>>> > > >>>>>>>>>>>>>>>>>   that is
>>>>>>>> > > >>>>>>>>>>>>>>>>>   regularly reported to the backend
>>>>>>>> (Prometheus),
>>>>>>>> > where a
>>>>>>>> > > >>>>>>> user
>>>>>>>> > > >>>>>>>>>>>>>>>>>   could build
>>>>>>>> > > >>>>>>>>>>>>>>>>>   a histogram of his choosing when/if he
>>>>>>>> wants it.
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>   On 22.02.2019 13:57, Becket Qin wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Hi Chesnay,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> ** Re: FLIP
>>>>>>>> > > >>>>>>>>>>>>>>>>>> I might have misunderstood this, but it
>>>>>>>> seems that
>>>>>>>> > > "major
>>>>>>>> > > >>>>>>>>>>>>>>>>>   changes" are well
>>>>>>>> > > >>>>>>>>>>>>>>>>>> defined in FLIP. The full contents is
>>>>>>>> following:
>>>>>>>> > > >>>>>>>>>>>>>>>>>> What is considered a "major change" that
>>>>>>>> needs a FLIP?
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Any of the following should be considered a
>>>>>>>> major
>>>>>>>> > > change:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> - Any major new feature, subsystem, or piece
>>>>>>>> of
>>>>>>>> > > >>>>>>>>>>>>>>>>>   functionality
>>>>>>>> > > >>>>>>>>>>>>>>>>>> - *Any change that impacts the public
>>>>>>>> interfaces of
>>>>>>>> > the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   project*
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> What are the "public interfaces" of the
>>>>>>>> project?
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> *All of the following are public interfaces
>>>>>>>> *that
>>>>>>>> > people
>>>>>>>> > > >>>>>>>>>>>>>>>>>   build around:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> - DataStream and DataSet API, including
>>>>>>>> classes
>>>>>>>> > related
>>>>>>>> > > >>>>>>>>>>>>>>>>>   to that, such as
>>>>>>>> > > >>>>>>>>>>>>>>>>>> StreamExecutionEnvironment
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> - Classes marked with the @Public annotation
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> - On-disk binary formats, such as
>>>>>>>> > > >>>>>>>>>>>>>> checkpoints/savepoints
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> - User-facing scripts/command-line tools,
>>>>>>>> i.e.
>>>>>>>> > > >>>>>>>>>>>>>>>>>   bin/flink, Yarn scripts,
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Mesos scripts
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> - Configuration settings
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> - *Exposed monitoring information*
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> So any monitoring information change is
>>>>>>>> considered as
>>>>>>>> > > >>>>>>>>>>>>>> public
>>>>>>>> > > >>>>>>>>>>>>>>>>>   interface, and
>>>>>>>> > > >>>>>>>>>>>>>>>>>> any public interface change is considered as
>>>>>>>> a "major
>>>>>>>> > > >>>>>>>>>>>>>>> change".
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> ** Re: over complication of implementation.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Although this is more of implementation
>>>>>>>> details that
>>>>>>>> > is
>>>>>>>> > > >> not
>>>>>>>> > > >>>>>>>>>>>>>>>>>   covered by the
>>>>>>>> > > >>>>>>>>>>>>>>>>>> FLIP. But it may be worth discussing.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> First of all, I completely agree that we
>>>>>>>> should use
>>>>>>>> > the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   simplest way to
>>>>>>>> > > >>>>>>>>>>>>>>>>>> achieve our goal. To me the goal is the
>>>>>>>> following:
>>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Clear connector conventions and
>>>>>>>> interfaces.
>>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. The easiness of creating a connector.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Both of them are important to the prosperity
>>>>>>>> of the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   connector ecosystem. So
>>>>>>>> > > >>>>>>>>>>>>>>>>>> I'd rather abstract as much as possible on
>>>>>>>> our side to
>>>>>>>> > > >> make
>>>>>>>> > > >>>>>>>>>>>>>>>>>   the connector
>>>>>>>> > > >>>>>>>>>>>>>>>>>> developer's work lighter. Given this goal, a
>>>>>>>> static
>>>>>>>> > util
>>>>>>>> > > >>>>>>>>>>>>>>>>>   method approach
>>>>>>>> > > >>>>>>>>>>>>>>>>>> might have a few drawbacks:
>>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Users still have to construct the metrics
>>>>>>>> by
>>>>>>>> > > >> themselves.
>>>>>>>> > > >>>>>>>>>>>>>>>>>   (And note that
>>>>>>>> > > >>>>>>>>>>>>>>>>>> this might be erroneous by itself. For
>>>>>>>> example, a
>>>>>>>> > > customer
>>>>>>>> > > >>>>>>>>>>>>>>>>>   wrapper around
>>>>>>>> > > >>>>>>>>>>>>>>>>>> dropwizard meter maybe used instead of
>>>>>>>> MeterView).
>>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. When connector specific metrics are
>>>>>>>> added, it is
>>>>>>>> > > >>>>>>>>>>>>>>>>>   difficult to enforce
>>>>>>>> > > >>>>>>>>>>>>>>>>>> the scope to be the same as standard metrics.
>>>>>>>> > > >>>>>>>>>>>>>>>>>> 3. It seems that a method proliferation is
>>>>>>>> inevitable
>>>>>>>> > if
>>>>>>>> > > >> we
>>>>>>>> > > >>>>>>>>>>>>>>>>>   want to apply
>>>>>>>> > > >>>>>>>>>>>>>>>>>> sanity checks. e.g. The metric of numBytesIn
>>>>>>>> was not
>>>>>>>> > > >>>>>>>>>>>>>>>>>   registered for a meter.
>>>>>>>> > > >>>>>>>>>>>>>>>>>> 4. Metrics are still defined in random
>>>>>>>> places and hard
>>>>>>>> > > to
>>>>>>>> > > >>>>>>>>>>>>>>>> track.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> The current PR I had was inspired by the
>>>>>>>> Config system
>>>>>>>> > > in
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Kafka, which I
>>>>>>>> > > >>>>>>>>>>>>>>>>>> found pretty handy. In fact it is not only
>>>>>>>> used by
>>>>>>>> > Kafka
>>>>>>>> > > >>>>>>>>>>>>>>>>>   itself but even
>>>>>>>> > > >>>>>>>>>>>>>>>>>> some other projects that depend on Kafka. I
>>>>>>>> am not
>>>>>>>> > > saying
>>>>>>>> > > >>>>>>>>>>>>>>>>>   this approach is
>>>>>>>> > > >>>>>>>>>>>>>>>>>> perfect. But I think it worths to save the
>>>>>>>> work for
>>>>>>>> > > >>>>>>>>>>>>>>>>>   connector writers and
>>>>>>>> > > >>>>>>>>>>>>>>>>>> encourage more systematic implementation.
>>>>>>>> That being
>>>>>>>> > > said,
>>>>>>>> > > >>>>>>>>>>>>>> I
>>>>>>>> > > >>>>>>>>>>>>>>>>>   am fully open
>>>>>>>> > > >>>>>>>>>>>>>>>>>> to suggestions.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Re: Histogram
>>>>>>>> > > >>>>>>>>>>>>>>>>>> I think there are two orthogonal questions
>>>>>>>> around
>>>>>>>> > those
>>>>>>>> > > >>>>>>>>>>>>>>>> metrics:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just
>>>>>>>> looking at
>>>>>>>> > the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   meaning of a
>>>>>>>> > > >>>>>>>>>>>>>>>>>> metric, is generic to all connectors? If the
>>>>>>>> answer is
>>>>>>>> > > >> yes,
>>>>>>>> > > >>>>>>>>>>>>>>>>>   we should
>>>>>>>> > > >>>>>>>>>>>>>>>>>> include the metric into the convention. No
>>>>>>>> matter
>>>>>>>> > > whether
>>>>>>>> > > >>>>>>>>>>>>>> we
>>>>>>>> > > >>>>>>>>>>>>>>>>>   include it
>>>>>>>> > > >>>>>>>>>>>>>>>>>> into the convention or not, some connector
>>>>>>>> > > implementations
>>>>>>>> > > >>>>>>>>>>>>>>>>>   will emit such
>>>>>>>> > > >>>>>>>>>>>>>>>>>> metric. It is better to have a convention
>>>>>>>> than letting
>>>>>>>> > > >> each
>>>>>>>> > > >>>>>>>>>>>>>>>>>   connector do
>>>>>>>> > > >>>>>>>>>>>>>>>>>> random things.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. If a standard metric is a histogram, what
>>>>>>>> should we
>>>>>>>> > > do?
>>>>>>>> > > >>>>>>>>>>>>>>>>>> I agree that we should make it clear that
>>>>>>>> using
>>>>>>>> > > histograms
>>>>>>>> > > >>>>>>>>>>>>>>>>>   will have
>>>>>>>> > > >>>>>>>>>>>>>>>>>> performance risk. But I do see histogram is
>>>>>>>> useful in
>>>>>>>> > > some
>>>>>>>> > > >>>>>>>>>>>>>>>>>   fine-granularity
>>>>>>>> > > >>>>>>>>>>>>>>>>>> debugging where one do not have the luxury
>>>>>>>> to stop the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   system and inject
>>>>>>>> > > >>>>>>>>>>>>>>>>>> more inspection code. So the workaround I am
>>>>>>>> thinking
>>>>>>>> > is
>>>>>>>> > > >> to
>>>>>>>> > > >>>>>>>>>>>>>>>>>   provide some
>>>>>>>> > > >>>>>>>>>>>>>>>>>> implementation suggestions. Assume later on
>>>>>>>> we have a
>>>>>>>> > > >>>>>>>>>>>>>>>>>   mechanism of
>>>>>>>> > > >>>>>>>>>>>>>>>>>> selective metrics. In the abstract metrics
>>>>>>>> class we
>>>>>>>> > can
>>>>>>>> > > >>>>>>>>>>>>>>>>>   disable those
>>>>>>>> > > >>>>>>>>>>>>>>>>>> metrics by default individual connector
>>>>>>>> writers does
>>>>>>>> > not
>>>>>>>> > > >>>>>>>>>>>>>>>>>   have to do
>>>>>>>> > > >>>>>>>>>>>>>>>>>> anything (this is another advantage of
>>>>>>>> having an
>>>>>>>> > > >>>>>>>>>>>>>>>>>   AbstractMetrics instead of
>>>>>>>> > > >>>>>>>>>>>>>>>>>> static util methods.)
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> I am not sure I fully understand the
>>>>>>>> histogram in the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   backend approach. Can
>>>>>>>> > > >>>>>>>>>>>>>>>>>> you explain a bit more? Do you mean emitting
>>>>>>>> the raw
>>>>>>>> > > data,
>>>>>>>> > > >>>>>>>>>>>>>>>>>   e.g. fetchTime
>>>>>>>> > > >>>>>>>>>>>>>>>>>> and emitTime with each record and let the
>>>>>>>> histogram
>>>>>>>> > > >>>>>>>>>>>>>>>>>   computation happen in
>>>>>>>> > > >>>>>>>>>>>>>>>>>> the background? Or let the processing thread
>>>>>>>> putting
>>>>>>>> > the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   values into a
>>>>>>>> > > >>>>>>>>>>>>>>>>>> queue and have a separate thread polling
>>>>>>>> from the
>>>>>>>> > queue
>>>>>>>> > > >> and
>>>>>>>> > > >>>>>>>>>>>>>>>>>   add them into
>>>>>>>> > > >>>>>>>>>>>>>>>>>> the histogram?
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay
>>>>>>>> Schepler
>>>>>>>> > > >>>>>>>>>>>>>>>>>   <ches...@apache.org <mailto:
>>>>>>>> ches...@apache.org>>
>>>>>>>> > > wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: Flip
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> The very first line under both the main
>>>>>>>> header and
>>>>>>>> > > >> Purpose
>>>>>>>> > > >>>>>>>>>>>>>>>>>   section
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> describe Flips as "major changes", which
>>>>>>>> this isn't.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: complication
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> I'm not arguing against standardization,
>>>>>>>> but again an
>>>>>>>> > > >>>>>>>>>>>>>>>>>   over-complicated
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> implementation when a static utility method
>>>>>>>> would be
>>>>>>>> > > >>>>>>>>>>>>>>>>>   sufficient.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> public static void setupConnectorMetrics(
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> String connectorName,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> ...)
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> This gives you all you need:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> * a well-defined set of metrics for a
>>>>>>>> connector to
>>>>>>>> > > opt-in
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> * standardized naming schemes for scope and
>>>>>>>> > individual
>>>>>>>> > > >>>>>>>>>>>>>>> metrics
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> * standardize metric types (although
>>>>>>>> personally I'm
>>>>>>>> > not
>>>>>>>> > > >>>>>>>>>>>>>>>>>   interested in that
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> since metric types should be considered
>>>>>>>> syntactic
>>>>>>>> > > sugar)
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: Configurable Histogram
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> If anything they _must_ be turned off by
>>>>>>>> default, but
>>>>>>>> > > the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   metric system is
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> already exposing so many options that I'm
>>>>>>>> not too
>>>>>>>> > keen
>>>>>>>> > > on
>>>>>>>> > > >>>>>>>>>>>>>>>>>   adding even more.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> You have also only addressed my first
>>>>>>>> argument
>>>>>>>> > against
>>>>>>>> > > >>>>>>>>>>>>>>>>>   histograms
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> (performance), the second one still stands
>>>>>>>> (calculate
>>>>>>>> > > >>>>>>>>>>>>>>>>>   histogram in metric
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> backends instead).
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Hi Chesnay,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is
>>>>>>>> worthy of a
>>>>>>>> > > >> FLIP
>>>>>>>> > > >>>>>>>>>>>>>>>>>   because it is
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> public API. According to the FLIP
>>>>>>>> description a FlIP
>>>>>>>> > > is
>>>>>>>> > > >>>>>>>>>>>>>>>>>   required in case
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> of:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - Any change that impacts the public
>>>>>>>> interfaces of
>>>>>>>> > > >>>>>>>>>>>>>>>>>   the project
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> and the following entry is found in the
>>>>>>>> definition
>>>>>>>> > of
>>>>>>>> > > >>>>>>>>>>>>>>>>>   "public interface".
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - Exposed monitoring information
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Metrics are critical to any production
>>>>>>>> system. So a
>>>>>>>> > > >> clear
>>>>>>>> > > >>>>>>>>>>>>>>>>>   metric
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> definition
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> is important for any serious users. For an
>>>>>>>> > > organization
>>>>>>>> > > >>>>>>>>>>>>>>>>>   with large Flink
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> installation, change in metrics means
>>>>>>>> great amount
>>>>>>>> > of
>>>>>>>> > > >>>>>>>>>>>>>>>>>   work. So such
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> changes
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> do need to be fully discussed and
>>>>>>>> documented.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: Histogram.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> We can discuss whether there is a better
>>>>>>>> way to
>>>>>>>> > expose
>>>>>>>> > > >>>>>>>>>>>>>>>>>   metrics that are
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> suitable for histograms. My
>>>>>>>> micro-benchmark on
>>>>>>>> > various
>>>>>>>> > > >>>>>>>>>>>>>>>>>   histogram
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> implementations also indicates that they
>>>>>>>> are
>>>>>>>> > > >>>>>>>>>>>>>> significantly
>>>>>>>> > > >>>>>>>>>>>>>>>>>   slower than
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> other metric types. But I don't think that
>>>>>>>> means
>>>>>>>> > never
>>>>>>>> > > >>>>>>>>>>>>>> use
>>>>>>>> > > >>>>>>>>>>>>>>>>>   histogram, but
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> means use it with caution. For example, we
>>>>>>>> can
>>>>>>>> > suggest
>>>>>>>> > > >>>>>>>>>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> implementations
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> to turn them off by default and only turn
>>>>>>>> it on for
>>>>>>>> > a
>>>>>>>> > > >>>>>>>>>>>>>>>>>   small amount of
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> time
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> when performing some micro-debugging.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: complication:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Connector conventions are essential for
>>>>>>>> Flink
>>>>>>>> > > ecosystem.
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink connectors
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> pool is probably the most important part
>>>>>>>> of Flink,
>>>>>>>> > > just
>>>>>>>> > > >>>>>>>>>>>>>>>>>   like any other
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> data
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> system. Clear conventions of connectors
>>>>>>>> will help
>>>>>>>> > > build
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink ecosystem
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> in
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a more organic way.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Take the metrics convention as an example,
>>>>>>>> imagine
>>>>>>>> > > >>>>>>>>>>>>>> someone
>>>>>>>> > > >>>>>>>>>>>>>>>>>   has developed
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> a
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Flink connector for System foo, and another
>>>>>>>> > developer
>>>>>>>> > > >> may
>>>>>>>> > > >>>>>>>>>>>>>>>>>   have developed
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> a
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for
>>>>>>>> Flink which
>>>>>>>> > > >>>>>>>>>>>>>>>>>   analyzes the Flink
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> job
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> performance based on metrics. With a clear
>>>>>>>> metric
>>>>>>>> > > >>>>>>>>>>>>>>>>>   convention, those two
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> projects could be developed independently.
>>>>>>>> Once
>>>>>>>> > users
>>>>>>>> > > >> put
>>>>>>>> > > >>>>>>>>>>>>>>>>>   them together,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> it would work without additional
>>>>>>>> modifications. This
>>>>>>>> > > >>>>>>>>>>>>>>>>>   cannot be easily
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> achieved by just defining a few constants.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: selective metrics:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate
>>>>>>>> thread.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> @Dawid
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> The primary purpose of establish such a
>>>>>>>> convention
>>>>>>>> > is
>>>>>>>> > > to
>>>>>>>> > > >>>>>>>>>>>>>>>>>   help developers
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> write connectors in a more compatible way.
>>>>>>>> The
>>>>>>>> > > >> convention
>>>>>>>> > > >>>>>>>>>>>>>>>>>   is supposed to
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> be
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> defined more proactively. So when look at
>>>>>>>> the
>>>>>>>> > > >> convention,
>>>>>>>> > > >>>>>>>>>>>>>>>>>   it seems more
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> important to see if the concept is
>>>>>>>> applicable to
>>>>>>>> > > >>>>>>>>>>>>>>>>>   connectors in general.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> It
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> might be true so far only Kafka connector
>>>>>>>> reports
>>>>>>>> > > >>>>>>>>>>>>>> latency.
>>>>>>>> > > >>>>>>>>>>>>>>>>>   But there
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> might
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> be hundreds of other connector
>>>>>>>> implementations in
>>>>>>>> > the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink ecosystem,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of
>>>>>>>> them also
>>>>>>>> > > >> emits
>>>>>>>> > > >>>>>>>>>>>>>>>>>   latency. I
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> think
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a lot of other sources actually also has
>>>>>>>> an append
>>>>>>>> > > >>>>>>>>>>>>>>>>>   timestamp. e.g.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> database
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I
>>>>>>>> wouldn't be
>>>>>>>> > > surprised
>>>>>>>> > > >>>>>>>>>>>>>>>>>   if some database
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> connector can also emit latency metrics.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay
>>>>>>>> Schepler
>>>>>>>> > > >>>>>>>>>>>>>>>>>   <ches...@apache.org <mailto:
>>>>>>>> ches...@apache.org>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to
>>>>>>>> investigate
>>>>>>>> > > this
>>>>>>>> > > >>>>>>>>>>>>>> as
>>>>>>>> > > >>>>>>>>>>>>>>>>>   part of this
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of
>>>>>>>> interest
>>>>>>>> > for
>>>>>>>> > > >>>>>>>>>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   entire metric
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> system, and should be designed for as
>>>>>>>> such.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to
>>>>>>>> the core of
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink. They are
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> significantly more expensive than other
>>>>>>>> metrics,
>>>>>>>> > and
>>>>>>>> > > >>>>>>>>>>>>>>>>>   calculating
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> histograms in the application is regarded
>>>>>>>> as an
>>>>>>>> > > >>>>>>>>>>>>>>>>>   anti-pattern by several
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> metric backends, who instead recommend to
>>>>>>>> expose
>>>>>>>> > the
>>>>>>>> > > >> raw
>>>>>>>> > > >>>>>>>>>>>>>>>>>   data and
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> calculate the histogram in the backend.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Second, this seems overly complicated.
>>>>>>>> Given that
>>>>>>>> > we
>>>>>>>> > > >>>>>>>>>>>>>>>>>   already established
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> that not all connectors will export all
>>>>>>>> metrics we
>>>>>>>> > > are
>>>>>>>> > > >>>>>>>>>>>>>>>>>   effectively
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> reducing this down to a consistent naming
>>>>>>>> scheme.
>>>>>>>> > We
>>>>>>>> > > >>>>>>>>>>>>>>>>>   don't need anything
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a
>>>>>>>> few
>>>>>>>> > > constants
>>>>>>>> > > >>>>>>>>>>>>>>>>>   that all
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connectors use.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of
>>>>>>>> a FLIP.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz
>>>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I
>>>>>>>> agree. But
>>>>>>>> > those
>>>>>>>> > > >>>>>>>>>>>>>>>>>   particular metrics
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now would
>>>>>>>> only be
>>>>>>>> > > >>>>>>>>>>>>>>>>>   reported if user uses
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal
>>>>>>>> timestampAssigner with
>>>>>>>> > > >>>>>>>>>>>>>>>>>   StreamCharacteristic
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds
>>>>>>>> like a very
>>>>>>>> > > >>>>>>>>>>>>>>>>>   specific case. I am
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> not
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic
>>>>>>>> metric that
>>>>>>>> > > will
>>>>>>>> > > >>>>>>>>>>>>>> be
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> disabled/absent for most of
>>>>>>>> implementations.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal
>>>>>>>> issue, that
>>>>>>>> > > might
>>>>>>>> > > >>>>>>>>>>>>>>>>>   make sense to
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> investigate in the future.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Dawid
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Hi Dawid,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes
>>>>>>>> sense to me.
>>>>>>>> > > >> There
>>>>>>>> > > >>>>>>>>>>>>>>>>>   are two cases
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> to
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> be
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> addressed.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a
>>>>>>>> guidance. It
>>>>>>>> > is
>>>>>>>> > > >>>>>>>>>>>>>>>>>   likely that a
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> only supports some but not all of the
>>>>>>>> metrics. In
>>>>>>>> > > >> that
>>>>>>>> > > >>>>>>>>>>>>>>>>>   case, each
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> implementation should have the freedom
>>>>>>>> to decide
>>>>>>>> > > >> which
>>>>>>>> > > >>>>>>>>>>>>>>>>>   metrics are
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are
>>>>>>>> supported, the
>>>>>>>> > > >>>>>>>>>>>>>>>>>   guidance should be
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> followed.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable
>>>>>>>> certain
>>>>>>>> > > >> metrics
>>>>>>>> > > >>>>>>>>>>>>>>>>>   for some reason
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of
>>>>>>>> data). A
>>>>>>>> > > generic
>>>>>>>> > > >>>>>>>>>>>>>>>>>   mechanism should
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> be
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> provided to allow user choose which
>>>>>>>> metrics are
>>>>>>>> > > >>>>>>>>>>>>>>>>>   reported. This
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> mechanism
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> should also be honored by the connector
>>>>>>>> > > >>>>>>>>>>>>>> implementations.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you?
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid
>>>>>>>> Wysakowicz
>>>>>>>> > <
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto:
>>>>>>>> > > dwysakow...@apache.org
>>>>>>>> > > >>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a
>>>>>>>> unified,
>>>>>>>> > > >>>>>>>>>>>>>>>>>   standard set of
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> metrics
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> for
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight
>>>>>>>> concerns
>>>>>>>> > about
>>>>>>>> > > >>>>>>>>>>>>>>>>>   fetchLatency and
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> latency though. They are computed
>>>>>>>> based on
>>>>>>>> > > EventTime
>>>>>>>> > > >>>>>>>>>>>>>>>>>   which is not a
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> purely
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often on
>>>>>>>> some
>>>>>>>> > > business
>>>>>>>> > > >>>>>>>>>>>>>>>>>   logic, might be
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> absent
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> or defined after source. Those metrics
>>>>>>>> could
>>>>>>>> > also
>>>>>>>> > > >>>>>>>>>>>>>>>>>   behave in a weird
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> way in
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore I
>>>>>>>> am not
>>>>>>>> > sure
>>>>>>>> > > >> if
>>>>>>>> > > >>>>>>>>>>>>>>>>>   we should
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> include
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we
>>>>>>>> could at
>>>>>>>> > least
>>>>>>>> > > >>>>>>>>>>>>>>>>>   introduce a feature
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think?
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Dawid
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to the
>>>>>>>> proposed
>>>>>>>> > > >>>>>>>>>>>>>>>>>   metrics. I'll start a
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> voting thread later toady.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket
>>>>>>>> Qin
>>>>>>>> > > >>>>>>>>>>>>>>>>>   <becket....@gmail.com <mailto:
>>>>>>>> becket....@gmail.com>>
>>>>>>>> > <
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto:
>>>>>>>> becket....@gmail.com
>>>>>>>> > >>
>>>>>>>> > > >>>>>>>>>>>>>>> wrote:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi folks,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP
>>>>>>>> discussion thread
>>>>>>>> > > >>>>>>>>>>>>>> about
>>>>>>>> > > >>>>>>>>>>>>>>>>>   standardize
>>>>>>>> > > >>>>>>>>>>>>>>>>>>> the
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> connector metrics.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a
>>>>>>>> convention
>>>>>>>> > of
>>>>>>>> > > >>>>>>>>>>>>>>>>>   Flink connector
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the
>>>>>>>> monitoring
>>>>>>>> > and
>>>>>>>> > > >>>>>>>>>>>>>>>>>   alerting on Flink
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> jobs.
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> The FLIP link is following:
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>
>>>>>>>> > > >>
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>>>>
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>>>
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>>>
>>>>>>>> > > >>>>>
>>>>>>>> > > >>>>>
>>>>>>>> > > >>
>>>>>>>> > > >>
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Konstantin Knauf
>>>>>>>>
>>>>>>>> https://twitter.com/snntrable
>>>>>>>>
>>>>>>>> https://github.com/knaufk
>>>>>>>>
>>>>>>>

Re: [DISCUSS] FLIP-33: Standardize connector metrics

Reply via email to