Having the watermark lag metric was the important part from my side. So +1 to go ahead.
On Fri, Sep 18, 2020 at 4:11 PM Becket Qin <becket....@gmail.com> wrote: > Hey Stephan, > > Thanks for the quick reply. I actually forgot to mention that I also added > a "watermarkLag" metric to the Source metrics in addition to the > "currentFetchEventTimeLag" and "currentEmitEventTimeLag". > So in the current proposal, both XXEventTimeLag will be based on event > timestamps, while the watermarkLag is based on watermark. And all the lags > are defined against the wall clock time. Does this cover what you are > proposing? > > Thanks, > > Jiangjie (Becket) Qin > > On Fri, Sep 18, 2020 at 6:08 PM Stephan Ewen <step...@ververica.com> > wrote: > >> Hi Becket! >> >> I am wondering if it makes sense to do the following small change: >> >> - Have "currentFetchEventTimeLag" be defined on event timestamps >> (optionally, if the source system exposes it) <== this is like in your >> proposal >> this helps understand how long the records were in the source before >> >> - BUT change "currentEmitEventTimeLag" to be >> "currentSourceWatermarkLag" instead. >> That way, users can see how far behind the wall-clock-time the source >> progress is all in all. >> It is also a more well defined metric as it does not oscillate with >> out-of-order events. >> >> What do you think? >> >> Best, >> Stephan >> >> >> >> On Fri, Sep 18, 2020 at 12:02 PM Becket Qin <becket....@gmail.com> wrote: >> >>> Hi folks, >>> >>> Thanks for all the great feedback. I have just updated FLIP-33 wiki with >>> the following changes: >>> >>> 1. Renaming. "currentFetchLatency" to "currentFetchEventTimeLag", >>> "currentLatency" to "currentEmitEventTimeLag". >>> 2. Added the public interface code change required for the new metrics. >>> 3. Added description of whether a metric is predefined or optional, and >>> which component is expected to update the metric. >>> >>> Please let me know if you have any questions. I'll start a vote in two >>> days if there are no further concerns. >>> >>> Thanks, >>> >>> Jiangjie (Becket) Qin >>> >>> On Wed, Sep 9, 2020 at 9:56 AM Becket Qin <becket....@gmail.com> wrote: >>> >>>> Hi Stephan, >>>> >>>> Thanks for the input. Just a few more clarifications / questions. >>>> >>>> *Num Bytes / Records Metrics* >>>> >>>> 1. At this point, the *numRecordsIn(Rate)* metrics exist in both >>>> OperatorIOMetricGroup and TaskIOMetricGroup. I did not find >>>> *numRecordsIn(Rate)* in the TaskIOMetricGroup updated anywhere other >>>> than in the unit tests. Am I missing something? >>>> >>>> 2. *numBytesIn(Rate)* metrics only exist in TaskIOMetricGroup. At this >>>> point, the SourceReaders only has access to a SourceReaderContext which >>>> provides an OperatorMetricGroup. So it seems that the connector developers >>>> are not able to update the *numBytesIn(Rate). *With the multiple >>>> Source chaining support, it is possible that there are multiple Sources are >>>> in the same task. So it looks that we need to add *numBytesIn(Rate)* >>>> to the operator metrics as well. >>>> >>>> >>>> *Current (Fetch) Latency* >>>> >>>> *currentFetchLatency* helps clearly tell whether the latency is caused >>>> by Flink or not. Backpressure is not the only reason that we see fetch >>>> latency. Even if there is no back pressure, the records may have passed a >>>> long pipeline before they entered Flink. For example, say the >>>> *currentLatency >>>> *is 10 seconds and there is no backpressure. Does that mean the record >>>> spent 10 seconds in the Source operator? If not, how much did Flink >>>> contribute to that 10 seconds of latency? These questions are frequently >>>> asked and hard to tell without the fetch latency. >>>> >>>> For "currentFetchLatency", we would need to understand timestamps >>>>> before the records are decoded. That is only possible for some sources, >>>>> where the client gives us the records in a (partially) decoded from >>>>> already >>>>> (like Kafka). Then, some work has been done between the fetch time and the >>>>> time we update the metric already, so it is already a bit closer to the >>>>> "currentFetchLatency". I think following this train of thought, there is >>>>> diminished benefit from that specific metric. >>>> >>>> >>>> We may not have to report the fetch latency before records are decoded. >>>> One solution is to remember the* FetchTime* when the encoded records >>>> are fetched, and report the fetch latency after the records are decoded by >>>> computing (*FetchTime - EventTime*). An approximate implementation >>>> would be adding a *FetchTime *field to the *RecordsWithSplitIds* assuming >>>> that all the records in that data structure are fetched at the same time. >>>> >>>> Thoughts? >>>> >>>> Thanks, >>>> >>>> Jiangjie (Becket) Qin >>>> >>>> On Wed, Sep 9, 2020 at 12:42 AM Stephan Ewen <step...@ververica.com> >>>> wrote: >>>> >>>>> Thanks for reviving this, Becket! >>>>> >>>>> I think Konstantin's comments are great. I'd add these points: >>>>> >>>>> *Num Bytes / Records Metrics* >>>>> >>>>> For "numBytesIn" and "numRecordsIn", we should reuse the >>>>> OperatorIOMetric group, then it also gets reported to the overview page in >>>>> the Web UI. >>>>> >>>>> The "numBytesInPerSecond" and "numRecordsInPerSecond" are >>>>> automatically derived metrics, no need to do anything once we populate the >>>>> above two metrics >>>>> >>>>> >>>>> *Current (Fetch) Latency* >>>>> >>>>> I would really go for "eventTimeLag" rather than "fetchLatency". I >>>>> think "eventTimeLag" is a term that has some adoption in the Flink >>>>> community and beyond. >>>>> >>>>> I am not so sure that I see the benefit between "currentLatency" and >>>>> "currentFetchLatency", (or event time lag before/after) as this only is >>>>> different by the time it takes to emit a batch. >>>>> - In a non-backpressured case, these should be virtually >>>>> identical (and both dominated by watermark lag, not the actual time it >>>>> takes the fetch to be emitted) >>>>> - In a backpressured case, why do you care about when data was >>>>> fetched, as opposed to emitted? Emitted time is relevant for application >>>>> semantics and checkpoints. Fetch time seems to be an implementation detail >>>>> (how much does the source buffer). >>>>> >>>>> The "currentLatency" (eventTimeLagAfter) can be computed >>>>> out-of-the-box, independent of a source implementation, so that is also a >>>>> good argument to make it the main metric. >>>>> We know timestamps and watermarks in the source. Except for cases >>>>> where no watermarks have been defined at all (batch jobs or pure >>>>> processing >>>>> time jobs), in which case this metric should probably be "Infinite". >>>>> >>>>> For "currentFetchLatency", we would need to understand timestamps >>>>> before the records are decoded. That is only possible for some sources, >>>>> where the client gives us the records in a (partially) decoded from >>>>> already >>>>> (like Kafka). Then, some work has been done between the fetch time and the >>>>> time we update the metric already, so it is already a bit closer to the >>>>> "currentFetchLatency". I think following this train of thought, there is >>>>> diminished benefit from that specific metric. >>>>> >>>>> >>>>> *Idle Time* >>>>> >>>>> I agree, it would be great to rename this. Maybe to "sourceWaitTime" >>>>> or "sourceIdleTime" so to make clear that this is not exactly the time >>>>> that >>>>> Flink's processing pipeline is idle, but the time where the source does >>>>> not >>>>> have new data. >>>>> >>>>> This is not an easy metric to collect, though (except maybe for the >>>>> sources that are only idle while they have no split assigned, like >>>>> continuous file source). >>>>> >>>>> *Source Specific Metrics* >>>>> >>>>> I believe source-specific would only be "sourceIdleTime", >>>>> "numRecordsInErrors", "pendingBytes", and "pendingRecords". >>>>> >>>>> >>>>> *Conclusion* >>>>> >>>>> We can probably add "numBytesIn" and "numRecordsIn" and "eventTimeLag" >>>>> right away, with little complexity. >>>>> I'd suggest to start with these right away. >>>>> >>>>> Best, >>>>> Stephan >>>>> >>>>> >>>>> On Tue, Sep 8, 2020 at 3:25 PM Becket Qin <becket....@gmail.com> >>>>> wrote: >>>>> >>>>>> Hey Konstantin, >>>>>> >>>>>> Thanks for the feedback and suggestions. Please see the reply below. >>>>>> >>>>>> * idleTime: In the meantime, a similar metric "idleTimeMsPerSecond" >>>>>>> has >>>>>>> been introduced in https://issues.apache.org/jira/browse/FLINK-16864. >>>>>>> They >>>>>>> have a similar name, but different definitions of idleness, >>>>>>> e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it is >>>>>>> backpressured. Can we make it clearer that these two metrics mean >>>>>>> different >>>>>>> things? >>>>>> >>>>>> >>>>>> That is a good point. I did not notice this metric earlier. It seems >>>>>> that both metrics are useful to the users. One tells them how busy the >>>>>> source is and how much more throughput the source can handle. The other >>>>>> tells the users how long since the source has seen the last record, which >>>>>> is useful for debugging. I'll update the FLIP to make it clear. >>>>>> >>>>>> * "current(Fetch)Latency" I am wondering if >>>>>>> "eventTimeLag(Before|After)" >>>>>>> is more descriptive/clear. What do others think? >>>>>> >>>>>> >>>>>> I am quite open to the ideas on these names. In fact I also feel >>>>>> "current(Fetch)Latency" are not super intuitive. So it would be great if >>>>>> we >>>>>> can have better names. >>>>>> >>>>>> * Current(Fetch)Latency implies that the timestamps are directly >>>>>>> extracted in the source connector, right? Will this be the default >>>>>>> for >>>>>>> FLIP-27 sources anyway? >>>>>> >>>>>> >>>>>> The "currentFetchLatency" will probably be reported by each source >>>>>> implementation, because the data fetching is done by SplitReaders and >>>>>> there >>>>>> is no base implementation. The "currentLatency", on the other hand, can >>>>>> be >>>>>> reported by the SourceReader base implementation. However, since >>>>>> developers >>>>>> can actually implement their own source connector without using our base >>>>>> implementation, these metric guidance are still useful for the connector >>>>>> developers in that case. >>>>>> >>>>>> * Does FLIP-33 also include the implementation of these metrics (to >>>>>>> the >>>>>>> extent possible) for all connectors currently available in Apache >>>>>>> Flink or >>>>>>> is the "per-connector implementation" out-of-scope? >>>>>> >>>>>> >>>>>> FLIP-33 itself does not specify any implementation of those metrics. >>>>>> But the connectors we provide in Apache Flink will follow the guidance of >>>>>> FLIP-33 to emit those metrics when applicable. Maybe We can have some >>>>>> public static Strings defined for the metric names to help other >>>>>> connector >>>>>> developers follow the same guidance. I can also add that to the public >>>>>> interface section of the FLIP if we decide to do that. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Jiangjie (Becket) Qin >>>>>> >>>>>> On Tue, Sep 8, 2020 at 9:02 PM Becket Qin <becket....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 8, 2020 at 6:55 PM Konstantin Knauf <kna...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Becket, >>>>>>>> >>>>>>>> Thank you for picking up this FLIP. I have a few questions: >>>>>>>> >>>>>>>> * two thoughts on naming: >>>>>>>> * idleTime: In the meantime, a similar metric >>>>>>>> "idleTimeMsPerSecond" has >>>>>>>> been introduced in >>>>>>>> https://issues.apache.org/jira/browse/FLINK-16864. They >>>>>>>> have a similar name, but different definitions of idleness, >>>>>>>> e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it is >>>>>>>> backpressured. Can we make it clearer that these two metrics mean >>>>>>>> different >>>>>>>> things? >>>>>>>> >>>>>>>> * "current(Fetch)Latency" I am wondering if >>>>>>>> "eventTimeLag(Before|After)" >>>>>>>> is more descriptive/clear. What do others think? >>>>>>>> >>>>>>>> * Current(Fetch)Latency implies that the timestamps are directly >>>>>>>> extracted in the source connector, right? Will this be the default >>>>>>>> for >>>>>>>> FLIP-27 sources anyway? >>>>>>>> >>>>>>>> * Does FLIP-33 also include the implementation of these metrics (to >>>>>>>> the >>>>>>>> extent possible) for all connectors currently available in Apache >>>>>>>> Flink or >>>>>>>> is the "per-connector implementation" out-of-scope? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Konstantin >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 4, 2020 at 4:56 PM Becket Qin <becket....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> > Hi all, >>>>>>>> > >>>>>>>> > To complete the Source refactoring work, I'd like to revive this >>>>>>>> > discussion. Since the mail thread has been dormant for more than >>>>>>>> a year, >>>>>>>> > just to recap the motivation of the FLIP: >>>>>>>> > >>>>>>>> > 1. The FLIP proposes to standardize the connector metrics by >>>>>>>> giving >>>>>>>> > guidance on the metric specifications, including the name, type >>>>>>>> and meaning >>>>>>>> > of the metrics. >>>>>>>> > 2. It is OK for a connector to not emit some of the metrics in >>>>>>>> the metric >>>>>>>> > guidance, but if a metric of the same semantic is emitted, the >>>>>>>> > implementation should follow the guidance. >>>>>>>> > 3. It is OK for a connector to emit more metrics than what are >>>>>>>> listed in >>>>>>>> > the FLIP. This includes having an alias for a metric specified in >>>>>>>> the FLIP. >>>>>>>> > 4. We will implement some of the metrics out of the box in the >>>>>>>> default >>>>>>>> > implementation of FLIP-27, as long as it is applicable. >>>>>>>> > >>>>>>>> > The FLIP wiki is following: >>>>>>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33 >>>>>>>> > %3A+Standardize+Connector+Metrics >>>>>>>> > >>>>>>>> > Any thoughts? >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > >>>>>>>> > Jiangjie (Becket) Qin >>>>>>>> > >>>>>>>> > >>>>>>>> > On Fri, Jun 14, 2019 at 2:29 PM Piotr Nowojski < >>>>>>>> pi...@ververica.com> >>>>>>>> > wrote: >>>>>>>> > >>>>>>>> > > > we will need to revisit the convention list and adjust them >>>>>>>> accordingly >>>>>>>> > > when FLIP-27 is ready >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > Yes, this sounds good :) >>>>>>>> > > >>>>>>>> > > Piotrek >>>>>>>> > > >>>>>>>> > > > On 13 Jun 2019, at 13:35, Becket Qin <becket....@gmail.com> >>>>>>>> wrote: >>>>>>>> > > > >>>>>>>> > > > Hi Piotr, >>>>>>>> > > > >>>>>>>> > > > That's great to know. Chances are that we will need to >>>>>>>> revisit the >>>>>>>> > > > convention list and adjust them accordingly when FLIP-27 is >>>>>>>> ready, At >>>>>>>> > > that >>>>>>>> > > > point we can mark some of the metrics as available by default >>>>>>>> for >>>>>>>> > > > connectors implementing the new interface. >>>>>>>> > > > >>>>>>>> > > > Thanks, >>>>>>>> > > > >>>>>>>> > > > Jiangjie (Becket) Qin >>>>>>>> > > > >>>>>>>> > > > On Thu, Jun 13, 2019 at 3:51 PM Piotr Nowojski < >>>>>>>> pi...@ververica.com> >>>>>>>> > > wrote: >>>>>>>> > > > >>>>>>>> > > >> Thanks for driving this. I’ve just noticed one small thing. >>>>>>>> With new >>>>>>>> > > >> SourceReader interface Flink will be able to provide >>>>>>>> `idleTime` metric >>>>>>>> > > >> automatically. >>>>>>>> > > >> >>>>>>>> > > >> Piotrek >>>>>>>> > > >> >>>>>>>> > > >>> On 13 Jun 2019, at 03:30, Becket Qin <becket....@gmail.com> >>>>>>>> wrote: >>>>>>>> > > >>> >>>>>>>> > > >>> Thanks all for the feedback and discussion. >>>>>>>> > > >>> >>>>>>>> > > >>> Since there wasn't any concern raised, I've started the >>>>>>>> voting thread >>>>>>>> > > for >>>>>>>> > > >>> this FLIP, but please feel free to continue the discussion >>>>>>>> here if >>>>>>>> > you >>>>>>>> > > >>> think something still needs to be addressed. >>>>>>>> > > >>> >>>>>>>> > > >>> Thanks, >>>>>>>> > > >>> >>>>>>>> > > >>> Jiangjie (Becket) Qin >>>>>>>> > > >>> >>>>>>>> > > >>> >>>>>>>> > > >>> >>>>>>>> > > >>> On Mon, Jun 10, 2019 at 9:10 AM Becket Qin < >>>>>>>> becket....@gmail.com> >>>>>>>> > > wrote: >>>>>>>> > > >>> >>>>>>>> > > >>>> Hi Piotr, >>>>>>>> > > >>>> >>>>>>>> > > >>>> Thanks for the comments. Yes, you are right. Users will >>>>>>>> have to look >>>>>>>> > > at >>>>>>>> > > >>>> other metrics to decide whether the pipeline is healthy or >>>>>>>> not in >>>>>>>> > the >>>>>>>> > > >> first >>>>>>>> > > >>>> place before they can use the time-based metric to fix the >>>>>>>> > bottleneck. >>>>>>>> > > >>>> >>>>>>>> > > >>>> I agree that once we have FLIP-27 ready, some of the >>>>>>>> metrics can >>>>>>>> > just >>>>>>>> > > be >>>>>>>> > > >>>> reported by the abstract implementation. >>>>>>>> > > >>>> >>>>>>>> > > >>>> I've updated FLIP-33 wiki page to add the pendingBytes and >>>>>>>> > > >> pendingRecords >>>>>>>> > > >>>> metric. Please let me know if you have any concern over >>>>>>>> the updated >>>>>>>> > > >> metric >>>>>>>> > > >>>> convention proposal. >>>>>>>> > > >>>> >>>>>>>> > > >>>> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen >>>>>>>> > > >>>> <step...@ververica.com> will you also have time to take a >>>>>>>> look at >>>>>>>> > the >>>>>>>> > > >>>> proposed metric convention? If there is no further concern >>>>>>>> I'll >>>>>>>> > start >>>>>>>> > > a >>>>>>>> > > >>>> voting thread for this FLIP in two days. >>>>>>>> > > >>>> >>>>>>>> > > >>>> Thanks, >>>>>>>> > > >>>> >>>>>>>> > > >>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>> >>>>>>>> > > >>>> >>>>>>>> > > >>>> >>>>>>>> > > >>>> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski < >>>>>>>> pi...@ververica.com> >>>>>>>> > > >> wrote: >>>>>>>> > > >>>> >>>>>>>> > > >>>>> Hi Becket, >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> Thanks for the answer :) >>>>>>>> > > >>>>> >>>>>>>> > > >>>>>> By time-based metric, I meant the portion of time spent >>>>>>>> on >>>>>>>> > producing >>>>>>>> > > >> the >>>>>>>> > > >>>>>> record to downstream. For example, a source connector >>>>>>>> can report >>>>>>>> > > that >>>>>>>> > > >>>>> it's >>>>>>>> > > >>>>>> spending 80% of time to emit record to downstream >>>>>>>> processing >>>>>>>> > > pipeline. >>>>>>>> > > >>>>> In >>>>>>>> > > >>>>>> another case, a sink connector may report that its >>>>>>>> spending 30% of >>>>>>>> > > >> time >>>>>>>> > > >>>>>> producing the records to the external system. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> This is in some sense equivalent to the buffer usage >>>>>>>> metric: >>>>>>>> > > >>>>> >>>>>>>> > > >>>>>> - 80% of time spent on emitting records to downstream >>>>>>>> ---> >>>>>>>> > > downstream >>>>>>>> > > >>>>>> node is bottleneck ---> output buffer is probably full. >>>>>>>> > > >>>>>> - 30% of time spent on emitting records to downstream >>>>>>>> ---> >>>>>>>> > > downstream >>>>>>>> > > >>>>>> node is not bottleneck ---> output buffer is probably >>>>>>>> not full. >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> If by “time spent on emitting records to downstream” you >>>>>>>> understand >>>>>>>> > > >>>>> “waiting on back pressure”, then I see your point. And I >>>>>>>> agree that >>>>>>>> > > >> some >>>>>>>> > > >>>>> kind of ratio/time based metric gives you more >>>>>>>> information. However >>>>>>>> > > >> under >>>>>>>> > > >>>>> “time spent on emitting records to downstream” might be >>>>>>>> hidden the >>>>>>>> > > >>>>> following (extreme) situation: >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> 1. Job is barely able to handle influx of records, there >>>>>>>> is 99% >>>>>>>> > > >>>>> CPU/resource usage in the cluster, but nobody is >>>>>>>> > > >>>>> bottlenecked/backpressured, all output buffers are empty, >>>>>>>> everybody >>>>>>>> > > is >>>>>>>> > > >>>>> waiting in 1% of it’s time for more records to process. >>>>>>>> > > >>>>> 2. 80% time can still be spent on "down stream >>>>>>>> operators”, because >>>>>>>> > > they >>>>>>>> > > >>>>> are the CPU intensive operations, but this doesn’t mean >>>>>>>> that >>>>>>>> > > >> increasing the >>>>>>>> > > >>>>> parallelism down the stream will help with anything >>>>>>>> there. To the >>>>>>>> > > >> contrary, >>>>>>>> > > >>>>> increasing parallelism of the source operator might help >>>>>>>> to >>>>>>>> > increase >>>>>>>> > > >>>>> resource utilisation up to 100%. >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> However, this “time based/ratio” approach can be extended >>>>>>>> to >>>>>>>> > > in/output >>>>>>>> > > >>>>> buffer usage. Besides collecting an information that >>>>>>>> input/output >>>>>>>> > > >> buffer is >>>>>>>> > > >>>>> full/empty, we can probe profile how often are buffer >>>>>>>> empty/full. >>>>>>>> > If >>>>>>>> > > >> output >>>>>>>> > > >>>>> buffer is full 1% of times, there is almost no back >>>>>>>> pressure. If >>>>>>>> > it’s >>>>>>>> > > >> full >>>>>>>> > > >>>>> 80% of times, there is some back pressure, if it’s full >>>>>>>> 99.9% of >>>>>>>> > > times, >>>>>>>> > > >>>>> there is huge back pressure. >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> Now for autoscaling you could compare the input & output >>>>>>>> buffers >>>>>>>> > fill >>>>>>>> > > >>>>> ratio: >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> 1. Both are high, the source of bottleneck is down the >>>>>>>> stream >>>>>>>> > > >>>>> 2. Output is low, input is high, this is the bottleneck >>>>>>>> and the >>>>>>>> > > higher >>>>>>>> > > >>>>> the difference, the bigger source of bottleneck is this is >>>>>>>> > > >> operator/task >>>>>>>> > > >>>>> 3. Output is high, input is low - there was some load >>>>>>>> spike that we >>>>>>>> > > are >>>>>>>> > > >>>>> currently finishing to process >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> But long story short, we are probably diverging from the >>>>>>>> topic of >>>>>>>> > > this >>>>>>>> > > >>>>> discussion, and we can discuss this at some later point. >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> For now, for sources: >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> as I wrote before, +1 for: >>>>>>>> > > >>>>> - pending.bytes, Gauge >>>>>>>> > > >>>>> - pending.messages, Gauge >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> When we will be developing/discussing SourceReader from >>>>>>>> FLIP-27 we >>>>>>>> > > >> might >>>>>>>> > > >>>>> then add: >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> - in-memory.buffer.usage (0 - 100%) >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> Which will be estimated automatically by Flink while user >>>>>>>> will be >>>>>>>> > > able >>>>>>>> > > >> to >>>>>>>> > > >>>>> override/provide better estimation. >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> Piotrek >>>>>>>> > > >>>>> >>>>>>>> > > >>>>>> On 5 Jun 2019, at 05:42, Becket Qin < >>>>>>>> becket....@gmail.com> wrote: >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> Hi Piotr, >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> Thanks for the explanation. Please see some >>>>>>>> clarifications below. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> By time-based metric, I meant the portion of time spent >>>>>>>> on >>>>>>>> > producing >>>>>>>> > > >> the >>>>>>>> > > >>>>>> record to downstream. For example, a source connector >>>>>>>> can report >>>>>>>> > > that >>>>>>>> > > >>>>> it's >>>>>>>> > > >>>>>> spending 80% of time to emit record to downstream >>>>>>>> processing >>>>>>>> > > pipeline. >>>>>>>> > > >>>>> In >>>>>>>> > > >>>>>> another case, a sink connector may report that its >>>>>>>> spending 30% of >>>>>>>> > > >> time >>>>>>>> > > >>>>>> producing the records to the external system. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> This is in some sense equivalent to the buffer usage >>>>>>>> metric: >>>>>>>> > > >>>>>> - 80% of time spent on emitting records to downstream >>>>>>>> ---> >>>>>>>> > > downstream >>>>>>>> > > >>>>>> node is bottleneck ---> output buffer is probably full. >>>>>>>> > > >>>>>> - 30% of time spent on emitting records to downstream >>>>>>>> ---> >>>>>>>> > > downstream >>>>>>>> > > >>>>>> node is not bottleneck ---> output buffer is probably >>>>>>>> not full. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> However, the time-based metric has a few advantages that >>>>>>>> the >>>>>>>> > buffer >>>>>>>> > > >>>>> usage >>>>>>>> > > >>>>>> metric may not have. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> 1. Buffer usage metric may not be applicable to all the >>>>>>>> connector >>>>>>>> > > >>>>>> implementations, while reporting time-based metric are >>>>>>>> always >>>>>>>> > > doable. >>>>>>>> > > >>>>>> Some source connectors may not have any input buffer, or >>>>>>>> they may >>>>>>>> > > use >>>>>>>> > > >>>>> some >>>>>>>> > > >>>>>> third party library that does not expose the input >>>>>>>> buffer at all. >>>>>>>> > > >>>>>> Similarly, for sink connectors, the implementation may >>>>>>>> not have >>>>>>>> > any >>>>>>>> > > >>>>> output >>>>>>>> > > >>>>>> buffer, or the third party library does not expose such >>>>>>>> buffer. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> 2. Although both type of metrics can detect bottleneck, >>>>>>>> time-based >>>>>>>> > > >>>>> metrics >>>>>>>> > > >>>>>> can be used to generate a more informed action to remove >>>>>>>> the >>>>>>>> > > >> bottleneck. >>>>>>>> > > >>>>>> For example, when the downstream is bottleneck, the >>>>>>>> output buffer >>>>>>>> > > >> usage >>>>>>>> > > >>>>>> metric is likely to be 100%, and the input buffer usage >>>>>>>> might be >>>>>>>> > 0%. >>>>>>>> > > >>>>> That >>>>>>>> > > >>>>>> means we don't know what is the suitable parallelism to >>>>>>>> lift the >>>>>>>> > > >>>>>> bottleneck. The time-based metric, on the other hand, >>>>>>>> would give >>>>>>>> > > >> useful >>>>>>>> > > >>>>>> information, e.g. if 80% of time was spent on emitting >>>>>>>> records, we >>>>>>>> > > can >>>>>>>> > > >>>>>> roughly increase the downstream node parallelism by 4 >>>>>>>> times. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> Admittedly, the time-based metrics are more expensive >>>>>>>> than buffer >>>>>>>> > > >>>>> usage. So >>>>>>>> > > >>>>>> we may have to do some sampling to reduce the cost. But >>>>>>>> in >>>>>>>> > general, >>>>>>>> > > >>>>> using >>>>>>>> > > >>>>>> time-based metrics seems worth adding. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> That being said, I don't think buffer usage metric and >>>>>>>> time-based >>>>>>>> > > >>>>> metrics >>>>>>>> > > >>>>>> are mutually exclusive. We can probably have both. It is >>>>>>>> just that >>>>>>>> > > in >>>>>>>> > > >>>>>> practice, features like auto-scaling might prefer >>>>>>>> time-based >>>>>>>> > metrics >>>>>>>> > > >> for >>>>>>>> > > >>>>>> the reason stated above. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>>> 1. Define the metrics that would allow us to manually >>>>>>>> detect >>>>>>>> > > >>>>> bottlenecks. >>>>>>>> > > >>>>>> As I wrote, we already have them in most of the places, >>>>>>>> except of >>>>>>>> > > >>>>>> sources/sinks. >>>>>>>> > > >>>>>>> 2. Use those metrics, to automatically detect >>>>>>>> bottlenecks. >>>>>>>> > > Currently >>>>>>>> > > >> we >>>>>>>> > > >>>>>> are only automatically detecting back pressure and >>>>>>>> reporting it to >>>>>>>> > > the >>>>>>>> > > >>>>> user >>>>>>>> > > >>>>>> in web UI (is it exposed as a metric at all?). Detecting >>>>>>>> the root >>>>>>>> > > >> cause >>>>>>>> > > >>>>> of >>>>>>>> > > >>>>>> the back pressure (bottleneck) is one step further. >>>>>>>> > > >>>>>>> 3. Use the knowledge about where exactly the bottleneck >>>>>>>> is >>>>>>>> > located, >>>>>>>> > > >> to >>>>>>>> > > >>>>>> try to do something with it. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> As explained above, I think time-based metric also >>>>>>>> addresses item >>>>>>>> > 1 >>>>>>>> > > >> and >>>>>>>> > > >>>>>> item 2. >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> Any thoughts? >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> Thanks, >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski < >>>>>>>> > pi...@ververica.com> >>>>>>>> > > >>>>> wrote: >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>>> Hi again :) >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>>> - pending.bytes, Gauge >>>>>>>> > > >>>>>>>> - pending.messages, Gauge >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> +1 >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> And true, instead of overloading one of the metric it >>>>>>>> is better >>>>>>>> > > when >>>>>>>> > > >>>>> user >>>>>>>> > > >>>>>>> can choose to provide only one of them. >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> Re 2: >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>>> If I understand correctly, this metric along with the >>>>>>>> pending >>>>>>>> > > >> mesages >>>>>>>> > > >>>>> / >>>>>>>> > > >>>>>>>> bytes would answer the questions of: >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>>> - Does the connector consume fast enough? Lagging >>>>>>>> behind + empty >>>>>>>> > > >>>>> buffer >>>>>>>> > > >>>>>>> = >>>>>>>> > > >>>>>>>> cannot consume fast enough. >>>>>>>> > > >>>>>>>> - Does the connector emit fast enough? Lagging behind >>>>>>>> + full >>>>>>>> > > buffer >>>>>>>> > > >> = >>>>>>>> > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is >>>>>>>> slow. >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> Yes, exactly. This can also be used to support >>>>>>>> decisions like >>>>>>>> > > >> changing >>>>>>>> > > >>>>> the >>>>>>>> > > >>>>>>> parallelism of the sources and/or down stream operators. >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> I’m not sure if I understand your proposal with time >>>>>>>> based >>>>>>>> > > >>>>> measurements. >>>>>>>> > > >>>>>>> Maybe I’m missing something, but I do not see how >>>>>>>> measuring time >>>>>>>> > > >> alone >>>>>>>> > > >>>>>>> could answer the problem: where is the bottleneck. Time >>>>>>>> spent on >>>>>>>> > > the >>>>>>>> > > >>>>>>> next/emit might be short or long (depending on how >>>>>>>> heavy to >>>>>>>> > process >>>>>>>> > > >> the >>>>>>>> > > >>>>>>> record is) and the source can still be bottlenecked/back >>>>>>>> > pressured >>>>>>>> > > or >>>>>>>> > > >>>>> not. >>>>>>>> > > >>>>>>> Usually the easiest and the most reliable way how to >>>>>>>> detect >>>>>>>> > > >>>>> bottlenecks is >>>>>>>> > > >>>>>>> by checking usage of input & output buffers, since when >>>>>>>> input >>>>>>>> > > buffer >>>>>>>> > > >> is >>>>>>>> > > >>>>>>> full while output buffer is empty, that’s the >>>>>>>> definition of a >>>>>>>> > > >>>>> bottleneck. >>>>>>>> > > >>>>>>> Also this is usually very easy and cheap to measure (it >>>>>>>> works >>>>>>>> > > >>>>> effectively >>>>>>>> > > >>>>>>> the same way as current’s Flink back pressure >>>>>>>> monitoring, but >>>>>>>> > more >>>>>>>> > > >>>>> cleanly, >>>>>>>> > > >>>>>>> without probing thread’s stack traces). >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> Also keep in mind that we are already using the buffer >>>>>>>> usage >>>>>>>> > > metrics >>>>>>>> > > >>>>> for >>>>>>>> > > >>>>>>> detecting the bottlenecks in Flink’s internal network >>>>>>>> exchanges >>>>>>>> > > >> (manual >>>>>>>> > > >>>>>>> work). That’s the reason why I wanted to extend this to >>>>>>>> > > >> sources/sinks, >>>>>>>> > > >>>>>>> since they are currently our blind spot. >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>>> One feature we are currently working on to scale Flink >>>>>>>> > > automatically >>>>>>>> > > >>>>>>> relies >>>>>>>> > > >>>>>>>> on some metrics answering the same question >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> That would be very helpful feature. I think in order to >>>>>>>> achieve >>>>>>>> > > that >>>>>>>> > > >> we >>>>>>>> > > >>>>>>> would need to: >>>>>>>> > > >>>>>>> 1. Define the metrics that would allow us to manually >>>>>>>> detect >>>>>>>> > > >>>>> bottlenecks. >>>>>>>> > > >>>>>>> As I wrote, we already have them in most of the places, >>>>>>>> except of >>>>>>>> > > >>>>>>> sources/sinks. >>>>>>>> > > >>>>>>> 2. Use those metrics, to automatically detect >>>>>>>> bottlenecks. >>>>>>>> > > Currently >>>>>>>> > > >> we >>>>>>>> > > >>>>>>> are only automatically detecting back pressure and >>>>>>>> reporting it >>>>>>>> > to >>>>>>>> > > >> the >>>>>>>> > > >>>>> user >>>>>>>> > > >>>>>>> in web UI (is it exposed as a metric at all?). >>>>>>>> Detecting the root >>>>>>>> > > >>>>> cause of >>>>>>>> > > >>>>>>> the back pressure (bottleneck) is one step further. >>>>>>>> > > >>>>>>> 3. Use the knowledge about where exactly the bottleneck >>>>>>>> is >>>>>>>> > located, >>>>>>>> > > >> to >>>>>>>> > > >>>>> try >>>>>>>> > > >>>>>>> to do something with it. >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> I think you are aiming for point 3., but before we >>>>>>>> reach it, we >>>>>>>> > are >>>>>>>> > > >>>>> still >>>>>>>> > > >>>>>>> missing 1. & 2. Also even if we have 3., there is a >>>>>>>> value in 1 & >>>>>>>> > 2 >>>>>>>> > > >> for >>>>>>>> > > >>>>>>> manual analysis/dashboards. >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> However, having the knowledge of where the bottleneck >>>>>>>> is, doesn’t >>>>>>>> > > >>>>>>> necessarily mean that we know what we can do about it. >>>>>>>> For >>>>>>>> > example >>>>>>>> > > >>>>>>> increasing parallelism might or might not help with >>>>>>>> anything >>>>>>>> > (data >>>>>>>> > > >>>>> skew, >>>>>>>> > > >>>>>>> bottleneck on some resource that does not scale), but >>>>>>>> this remark >>>>>>>> > > >>>>> applies >>>>>>>> > > >>>>>>> always, regardless of the way how did we detect the >>>>>>>> bottleneck. >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> Piotrek >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>>> On 3 Jun 2019, at 06:16, Becket Qin < >>>>>>>> becket....@gmail.com> >>>>>>>> > wrote: >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> Hi Piotr, >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> Thanks for the suggestion. Some thoughts below: >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> Re 1: The pending messages / bytes. >>>>>>>> > > >>>>>>>> I completely agree these are very useful metrics and >>>>>>>> we should >>>>>>>> > > >> expect >>>>>>>> > > >>>>> the >>>>>>>> > > >>>>>>>> connector to report. WRT the way to expose them, it >>>>>>>> seems more >>>>>>>> > > >>>>> consistent >>>>>>>> > > >>>>>>>> to add two metrics instead of adding a method (unless >>>>>>>> there are >>>>>>>> > > >> other >>>>>>>> > > >>>>> use >>>>>>>> > > >>>>>>>> cases other than metric reporting). So we can add the >>>>>>>> following >>>>>>>> > > two >>>>>>>> > > >>>>>>> metrics. >>>>>>>> > > >>>>>>>> - pending.bytes, Gauge >>>>>>>> > > >>>>>>>> - pending.messages, Gauge >>>>>>>> > > >>>>>>>> Applicable connectors can choose to report them. These >>>>>>>> two >>>>>>>> > metrics >>>>>>>> > > >>>>> along >>>>>>>> > > >>>>>>>> with latency should be sufficient for users to >>>>>>>> understand the >>>>>>>> > > >> progress >>>>>>>> > > >>>>>>> of a >>>>>>>> > > >>>>>>>> connector. >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> Re 2: Number of buffered data in-memory of the >>>>>>>> connector >>>>>>>> > > >>>>>>>> If I understand correctly, this metric along with the >>>>>>>> pending >>>>>>>> > > >> mesages >>>>>>>> > > >>>>> / >>>>>>>> > > >>>>>>>> bytes would answer the questions of: >>>>>>>> > > >>>>>>>> - Does the connector consume fast enough? Lagging >>>>>>>> behind + empty >>>>>>>> > > >>>>> buffer >>>>>>>> > > >>>>>>> = >>>>>>>> > > >>>>>>>> cannot consume fast enough. >>>>>>>> > > >>>>>>>> - Does the connector emit fast enough? Lagging behind >>>>>>>> + full >>>>>>>> > > buffer >>>>>>>> > > >> = >>>>>>>> > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is >>>>>>>> slow. >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> One feature we are currently working on to scale Flink >>>>>>>> > > automatically >>>>>>>> > > >>>>>>> relies >>>>>>>> > > >>>>>>>> on some metrics answering the same question, more >>>>>>>> specifically, >>>>>>>> > we >>>>>>>> > > >> are >>>>>>>> > > >>>>>>>> profiling the time spent on .next() method (time to >>>>>>>> consume) and >>>>>>>> > > the >>>>>>>> > > >>>>> time >>>>>>>> > > >>>>>>>> spent on .collect() method (time to emit / process). >>>>>>>> One >>>>>>>> > advantage >>>>>>>> > > >> of >>>>>>>> > > >>>>>>> such >>>>>>>> > > >>>>>>>> method level time cost allows us to calculate the >>>>>>>> parallelism >>>>>>>> > > >>>>> required to >>>>>>>> > > >>>>>>>> keep up in case their is a lag. >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> However, one concern I have regarding such metric is >>>>>>>> that they >>>>>>>> > are >>>>>>>> > > >>>>>>>> implementation specific. Either profiling on the >>>>>>>> method time, or >>>>>>>> > > >>>>>>> reporting >>>>>>>> > > >>>>>>>> buffer usage assumes the connector are implemented in >>>>>>>> such a >>>>>>>> > way. >>>>>>>> > > A >>>>>>>> > > >>>>>>>> slightly better solution might be have the following >>>>>>>> metric: >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> - EmitTimeRatio (or FetchTimeRatio): The time spent >>>>>>>> on emitting >>>>>>>> > > >>>>>>>> records / Total time elapsed. >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> This assumes that the source connectors have to emit >>>>>>>> the records >>>>>>>> > > to >>>>>>>> > > >>>>> the >>>>>>>> > > >>>>>>>> downstream at some point. The emission may take some >>>>>>>> time ( e.g. >>>>>>>> > > go >>>>>>>> > > >>>>>>> through >>>>>>>> > > >>>>>>>> chained operators). And the rest of the time are spent >>>>>>>> to >>>>>>>> > prepare >>>>>>>> > > >> the >>>>>>>> > > >>>>>>>> record to emit, including time for consuming and format >>>>>>>> > > conversion, >>>>>>>> > > >>>>> etc. >>>>>>>> > > >>>>>>>> Ideally, we'd like to see the time spent on record >>>>>>>> fetch and >>>>>>>> > emit >>>>>>>> > > to >>>>>>>> > > >>>>> be >>>>>>>> > > >>>>>>>> about the same, so no one is bottleneck for the other. >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> The downside of these time based metrics is additional >>>>>>>> overhead >>>>>>>> > to >>>>>>>> > > >> get >>>>>>>> > > >>>>>>> the >>>>>>>> > > >>>>>>>> time, therefore sampling might be needed. But in >>>>>>>> practice I feel >>>>>>>> > > >> such >>>>>>>> > > >>>>>>> time >>>>>>>> > > >>>>>>>> based metric might be more useful if we want to take >>>>>>>> action. >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> I think we should absolutely add metrics in (1) to the >>>>>>>> metric >>>>>>>> > > >>>>> convention. >>>>>>>> > > >>>>>>>> We could also add the metrics mentioned in (2) if we >>>>>>>> reach >>>>>>>> > > consensus >>>>>>>> > > >>>>> on >>>>>>>> > > >>>>>>>> that. What do you think? >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> Thanks, >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski < >>>>>>>> > > pi...@ververica.com >>>>>>>> > > >>> >>>>>>>> > > >>>>>>> wrote: >>>>>>>> > > >>>>>>>> >>>>>>>> > > >>>>>>>>> Hey Becket, >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> Re 1a) and 1b) +1 from my side. >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> I’ve discussed this issue: >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us >>>>>>>> to check >>>>>>>> > > the >>>>>>>> > > >>>>> cause >>>>>>>> > > >>>>>>>>> of >>>>>>>> > > >>>>>>>>>>>> back pressure: >>>>>>>> > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes? >>>>>>>> Or boolean >>>>>>>> > > >>>>>>>>>>>> hasSomethingl/isEmpty) >>>>>>>> > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or >>>>>>>> boolean >>>>>>>> > > >>>>>>>>>>>> hasSomething/isEmpty) >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> With Nico at some lengths and he also saw the >>>>>>>> benefits of them. >>>>>>>> > > We >>>>>>>> > > >>>>> also >>>>>>>> > > >>>>>>>>> have more concrete proposal for that. >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> Actually there are two really useful metrics, that we >>>>>>>> are >>>>>>>> > missing >>>>>>>> > > >>>>>>>>> currently: >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> 1. Number of data/records/bytes in the backlog to >>>>>>>> process. For >>>>>>>> > > >>>>> example >>>>>>>> > > >>>>>>>>> remaining number of bytes in unread files. Or pending >>>>>>>> data in >>>>>>>> > > Kafka >>>>>>>> > > >>>>>>> topics. >>>>>>>> > > >>>>>>>>> 2. Number of buffered data in-memory of the >>>>>>>> connector, that are >>>>>>>> > > >>>>> waiting >>>>>>>> > > >>>>>>> to >>>>>>>> > > >>>>>>>>> be processed pushed to Flink pipeline. >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> Re 1: >>>>>>>> > > >>>>>>>>> This would have to be a metric provided directly by a >>>>>>>> > connector. >>>>>>>> > > It >>>>>>>> > > >>>>>>> could >>>>>>>> > > >>>>>>>>> be an undefined `int`: >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> `int backlog` - estimate of pending work. >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> “Undefined” meaning that it would be up to a >>>>>>>> connector to >>>>>>>> > decided >>>>>>>> > > >>>>>>> whether >>>>>>>> > > >>>>>>>>> it’s measured in bytes, records, pending files or >>>>>>>> whatever it >>>>>>>> > is >>>>>>>> > > >>>>>>> possible >>>>>>>> > > >>>>>>>>> to provide by the connector. This is because I assume >>>>>>>> not every >>>>>>>> > > >>>>>>> connector >>>>>>>> > > >>>>>>>>> can provide exact number and for some of them it >>>>>>>> might be >>>>>>>> > > >> impossible >>>>>>>> > > >>>>> to >>>>>>>> > > >>>>>>>>> provide records number of bytes count. >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> Re 2: >>>>>>>> > > >>>>>>>>> This metric could be either provided by a connector, >>>>>>>> or >>>>>>>> > > calculated >>>>>>>> > > >>>>>>> crudely >>>>>>>> > > >>>>>>>>> by Flink: >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> `float bufferUsage` - value from [0.0, 1.0] range >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> Percentage of used in memory buffers, like in Kafka’s >>>>>>>> handover. >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> It could be crudely implemented by Flink with FLIP-27 >>>>>>>> > > >>>>>>>>> SourceReader#isAvailable. If SourceReader is not >>>>>>>> available >>>>>>>> > > reported >>>>>>>> > > >>>>>>>>> `bufferUsage` could be 0.0. If it is available, it >>>>>>>> could be >>>>>>>> > 1.0. >>>>>>>> > > I >>>>>>>> > > >>>>> think >>>>>>>> > > >>>>>>>>> this would be a good enough estimation for most of >>>>>>>> the use >>>>>>>> > cases >>>>>>>> > > >>>>> (that >>>>>>>> > > >>>>>>>>> could be overloaded and implemented better if >>>>>>>> desired). >>>>>>>> > > Especially >>>>>>>> > > >>>>>>> since we >>>>>>>> > > >>>>>>>>> are reporting only probed values: if probed values >>>>>>>> are almost >>>>>>>> > > >> always >>>>>>>> > > >>>>>>> “1.0”, >>>>>>>> > > >>>>>>>>> it would mean that we have a back pressure. If they >>>>>>>> are almost >>>>>>>> > > >> always >>>>>>>> > > >>>>>>>>> “0.0”, there is probably no back pressure at the >>>>>>>> sources. >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> What do you think about this? >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> Piotrek >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>>> On 30 May 2019, at 11:41, Becket Qin < >>>>>>>> becket....@gmail.com> >>>>>>>> > > >> wrote: >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> Hi all, >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> Thanks a lot for all the feedback and comments. I'd >>>>>>>> like to >>>>>>>> > > >> continue >>>>>>>> > > >>>>>>> the >>>>>>>> > > >>>>>>>>>> discussion on this FLIP. >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> I updated the FLIP-33 wiki to remove all the >>>>>>>> histogram metrics >>>>>>>> > > >> from >>>>>>>> > > >>>>> the >>>>>>>> > > >>>>>>>>>> first version of metric convention due to the >>>>>>>> performance >>>>>>>> > > concern. >>>>>>>> > > >>>>> The >>>>>>>> > > >>>>>>>>> plan >>>>>>>> > > >>>>>>>>>> is to introduce them later when we have a mechanism >>>>>>>> to opt >>>>>>>> > > in/out >>>>>>>> > > >>>>>>>>> metrics. >>>>>>>> > > >>>>>>>>>> At that point, users can decide whether they want to >>>>>>>> pay the >>>>>>>> > > cost >>>>>>>> > > >> to >>>>>>>> > > >>>>>>> get >>>>>>>> > > >>>>>>>>>> the metric or not. >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> As Stephan suggested, for this FLIP, let's first try >>>>>>>> to agree >>>>>>>> > on >>>>>>>> > > >> the >>>>>>>> > > >>>>>>>>> small >>>>>>>> > > >>>>>>>>>> list of conventional metrics that connectors should >>>>>>>> follow. >>>>>>>> > > >>>>>>>>>> Just to be clear, the purpose of the convention is >>>>>>>> not to >>>>>>>> > > enforce >>>>>>>> > > >>>>> every >>>>>>>> > > >>>>>>>>>> connector to report all these metrics, but to >>>>>>>> provide a >>>>>>>> > guidance >>>>>>>> > > >> in >>>>>>>> > > >>>>>>> case >>>>>>>> > > >>>>>>>>>> these metrics are reported by some connectors. >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> @ Stephan & Chesnay, >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> Regarding the duplication of `RecordsIn` metric in >>>>>>>> operator / >>>>>>>> > > task >>>>>>>> > > >>>>>>>>>> IOMetricGroups, from what I understand, for source >>>>>>>> operator, >>>>>>>> > it >>>>>>>> > > is >>>>>>>> > > >>>>>>>>> actually >>>>>>>> > > >>>>>>>>>> the SourceFunction that reports the operator level >>>>>>>> > > >>>>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are >>>>>>>> essentially >>>>>>>> > the >>>>>>>> > > >>>>> same >>>>>>>> > > >>>>>>>>>> metric in the operator level IOMetricGroup. >>>>>>>> Similarly for the >>>>>>>> > > Sink >>>>>>>> > > >>>>>>>>>> operator, the operator level >>>>>>>> RecordsOut/RecordsOutPerSecond >>>>>>>> > > >> metrics >>>>>>>> > > >>>>> are >>>>>>>> > > >>>>>>>>>> also reported by the Sink function. I marked them as >>>>>>>> existing >>>>>>>> > in >>>>>>>> > > >> the >>>>>>>> > > >>>>>>>>>> FLIP-33 wiki page. Please let me know if I >>>>>>>> misunderstood. >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin < >>>>>>>> > > becket....@gmail.com> >>>>>>>> > > >>>>>>> wrote: >>>>>>>> > > >>>>>>>>>> >>>>>>>> > > >>>>>>>>>>> Hi Piotr, >>>>>>>> > > >>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>> Thanks a lot for the feedback. >>>>>>>> > > >>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>> 1a) I guess you are referring to the part that >>>>>>>> "original >>>>>>>> > system >>>>>>>> > > >>>>>>> specific >>>>>>>> > > >>>>>>>>>>> metrics should also be reported". The performance >>>>>>>> impact >>>>>>>> > > depends >>>>>>>> > > >> on >>>>>>>> > > >>>>>>> the >>>>>>>> > > >>>>>>>>>>> implementation. An efficient implementation would >>>>>>>> only record >>>>>>>> > > the >>>>>>>> > > >>>>>>> metric >>>>>>>> > > >>>>>>>>>>> once, but report them with two different metric >>>>>>>> names. This >>>>>>>> > is >>>>>>>> > > >>>>>>> unlikely >>>>>>>> > > >>>>>>>>> to >>>>>>>> > > >>>>>>>>>>> hurt performance. >>>>>>>> > > >>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>> 1b) Yes, I agree that we should avoid adding >>>>>>>> overhead to the >>>>>>>> > > >>>>> critical >>>>>>>> > > >>>>>>>>> path >>>>>>>> > > >>>>>>>>>>> by all means. This is sometimes a tradeoff, running >>>>>>>> blindly >>>>>>>> > > >> without >>>>>>>> > > >>>>>>> any >>>>>>>> > > >>>>>>>>>>> metric gives best performance, but sometimes might >>>>>>>> be >>>>>>>> > > frustrating >>>>>>>> > > >>>>> when >>>>>>>> > > >>>>>>>>> we >>>>>>>> > > >>>>>>>>>>> debug some issues. >>>>>>>> > > >>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>> 2. The metrics are indeed very useful. Are they >>>>>>>> supposed to >>>>>>>> > be >>>>>>>> > > >>>>>>> reported >>>>>>>> > > >>>>>>>>> by >>>>>>>> > > >>>>>>>>>>> the connectors or Flink itself? At this point >>>>>>>> FLIP-33 is more >>>>>>>> > > >>>>> focused >>>>>>>> > > >>>>>>> on >>>>>>>> > > >>>>>>>>>>> provide a guidance to the connector authors on the >>>>>>>> metrics >>>>>>>> > > >>>>> reporting. >>>>>>>> > > >>>>>>>>> That >>>>>>>> > > >>>>>>>>>>> said, after FLIP-27, I think we should absolutely >>>>>>>> report >>>>>>>> > these >>>>>>>> > > >>>>> metrics >>>>>>>> > > >>>>>>>>> in >>>>>>>> > > >>>>>>>>>>> the abstract implementation. In any case, the metric >>>>>>>> > convention >>>>>>>> > > >> in >>>>>>>> > > >>>>>>> this >>>>>>>> > > >>>>>>>>>>> list are expected to evolve over time. >>>>>>>> > > >>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski < >>>>>>>> > > >>>>> pi...@ververica.com> >>>>>>>> > > >>>>>>>>>>> wrote: >>>>>>>> > > >>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> Hi, >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> Thanks for the proposal and driving the effort >>>>>>>> here Becket >>>>>>>> > :) >>>>>>>> > > >> I’ve >>>>>>>> > > >>>>>>> read >>>>>>>> > > >>>>>>>>>>>> through the FLIP-33 [1], and here are couple of my >>>>>>>> thoughts. >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> Big +1 for standardising the metric names between >>>>>>>> > connectors, >>>>>>>> > > it >>>>>>>> > > >>>>> will >>>>>>>> > > >>>>>>>>>>>> definitely help us and users a lot. >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> Issues/questions/things to discuss that I’ve >>>>>>>> thought of: >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> 1a. If we are about to duplicate some metrics, can >>>>>>>> this >>>>>>>> > > become a >>>>>>>> > > >>>>>>>>>>>> performance issue? >>>>>>>> > > >>>>>>>>>>>> 1b. Generally speaking, we should make sure that >>>>>>>> collecting >>>>>>>> > > >> those >>>>>>>> > > >>>>>>>>> metrics >>>>>>>> > > >>>>>>>>>>>> is as non intrusive as possible, especially that >>>>>>>> they will >>>>>>>> > > need >>>>>>>> > > >>>>> to be >>>>>>>> > > >>>>>>>>>>>> updated once per record. (They might be collected >>>>>>>> more >>>>>>>> > rarely >>>>>>>> > > >> with >>>>>>>> > > >>>>>>> some >>>>>>>> > > >>>>>>>>>>>> overhead, but the hot path of updating it per >>>>>>>> record will >>>>>>>> > need >>>>>>>> > > >> to >>>>>>>> > > >>>>> be >>>>>>>> > > >>>>>>> as >>>>>>>> > > >>>>>>>>>>>> quick as possible). That includes both avoiding >>>>>>>> heavy >>>>>>>> > > >> computation >>>>>>>> > > >>>>> on >>>>>>>> > > >>>>>>>>> per >>>>>>>> > > >>>>>>>>>>>> record path: histograms?, measuring time for time >>>>>>>> based >>>>>>>> > > metrics >>>>>>>> > > >>>>> (per >>>>>>>> > > >>>>>>>>>>>> second) (System.currentTimeMillis() depending on >>>>>>>> the >>>>>>>> > > >>>>> implementation >>>>>>>> > > >>>>>>> can >>>>>>>> > > >>>>>>>>>>>> invoke a system call) >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us >>>>>>>> to check >>>>>>>> > > the >>>>>>>> > > >>>>> cause >>>>>>>> > > >>>>>>>>> of >>>>>>>> > > >>>>>>>>>>>> back pressure: >>>>>>>> > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes? >>>>>>>> Or boolean >>>>>>>> > > >>>>>>>>>>>> hasSomethingl/isEmpty) >>>>>>>> > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or >>>>>>>> boolean >>>>>>>> > > >>>>>>>>>>>> hasSomething/isEmpty) >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> a) is useful in a scenario when we are processing >>>>>>>> backlog of >>>>>>>> > > >>>>> records, >>>>>>>> > > >>>>>>>>> all >>>>>>>> > > >>>>>>>>>>>> of the internal Flink’s input/output network >>>>>>>> buffers are >>>>>>>> > > empty, >>>>>>>> > > >>>>> and >>>>>>>> > > >>>>>>> we >>>>>>>> > > >>>>>>>>> want >>>>>>>> > > >>>>>>>>>>>> to check whether the external source system is the >>>>>>>> > bottleneck >>>>>>>> > > >>>>>>> (source’s >>>>>>>> > > >>>>>>>>>>>> input queue will be empty), or if the Flink’s >>>>>>>> connector is >>>>>>>> > the >>>>>>>> > > >>>>>>>>> bottleneck >>>>>>>> > > >>>>>>>>>>>> (source’s input queues will be full). >>>>>>>> > > >>>>>>>>>>>> b) similar story. Backlog of records, but this >>>>>>>> time all of >>>>>>>> > the >>>>>>>> > > >>>>>>> internal >>>>>>>> > > >>>>>>>>>>>> Flink’s input/ouput network buffers are full, and >>>>>>>> we want o >>>>>>>> > > >> check >>>>>>>> > > >>>>>>>>> whether >>>>>>>> > > >>>>>>>>>>>> the external sink system is the bottleneck (sink >>>>>>>> output >>>>>>>> > queues >>>>>>>> > > >> are >>>>>>>> > > >>>>>>>>> full), >>>>>>>> > > >>>>>>>>>>>> or if the Flink’s connector is the bottleneck >>>>>>>> (sink’s output >>>>>>>> > > >>>>> queues >>>>>>>> > > >>>>>>>>> will be >>>>>>>> > > >>>>>>>>>>>> empty) >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> It might be sometimes difficult to provide those >>>>>>>> metrics, so >>>>>>>> > > >> they >>>>>>>> > > >>>>>>> could >>>>>>>> > > >>>>>>>>>>>> be optional, but if we could provide them, it >>>>>>>> would be >>>>>>>> > really >>>>>>>> > > >>>>>>> helpful. >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> Piotrek >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> [1] >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>> >>>>>>>> > > >> >>>>>>>> > > >>>>>>>> > >>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics >>>>>>>> > > >>>>>>>>>>>> < >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>> >>>>>>>> > > >> >>>>>>>> > > >>>>>>>> > >>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen < >>>>>>>> se...@apache.org> >>>>>>>> > > >> wrote: >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>> I think this sounds reasonable. >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>> Let's keep the "reconfiguration without stopping >>>>>>>> the job" >>>>>>>> > out >>>>>>>> > > >> of >>>>>>>> > > >>>>>>> this, >>>>>>>> > > >>>>>>>>>>>>> because that would be a super big effort and if >>>>>>>> we approach >>>>>>>> > > >> that, >>>>>>>> > > >>>>>>> then >>>>>>>> > > >>>>>>>>>>>> in >>>>>>>> > > >>>>>>>>>>>>> more generic way rather than specific to >>>>>>>> connector metrics. >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>> I would suggest to look at the following things >>>>>>>> before >>>>>>>> > > starting >>>>>>>> > > >>>>> with >>>>>>>> > > >>>>>>>>> any >>>>>>>> > > >>>>>>>>>>>>> implementation work: >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>> - Try and find a committer to support this, >>>>>>>> otherwise it >>>>>>>> > will >>>>>>>> > > >> be >>>>>>>> > > >>>>>>> hard >>>>>>>> > > >>>>>>>>>>>> to >>>>>>>> > > >>>>>>>>>>>>> make progress >>>>>>>> > > >>>>>>>>>>>>> - Start with defining a smaller set of "core >>>>>>>> metrics" and >>>>>>>> > > >> extend >>>>>>>> > > >>>>> the >>>>>>>> > > >>>>>>>>>>>> set >>>>>>>> > > >>>>>>>>>>>>> later. I think that is easier than now blocking >>>>>>>> on reaching >>>>>>>> > > >>>>>>> consensus >>>>>>>> > > >>>>>>>>>>>> on a >>>>>>>> > > >>>>>>>>>>>>> large group of metrics. >>>>>>>> > > >>>>>>>>>>>>> - Find a solution to the problem Chesnay >>>>>>>> mentioned, that >>>>>>>> > the >>>>>>>> > > >>>>>>> "records >>>>>>>> > > >>>>>>>>>>>> in" >>>>>>>> > > >>>>>>>>>>>>> metric is somehow overloaded and exists already >>>>>>>> in the IO >>>>>>>> > > >> Metric >>>>>>>> > > >>>>>>>>> group. >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin < >>>>>>>> > > >> becket....@gmail.com >>>>>>>> > > >>>>>> >>>>>>>> > > >>>>>>>>>>>> wrote: >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> Hi Stephan, >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> Thanks a lot for the feedback. All makes sense. >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> It is a good suggestion to simply have an >>>>>>>> > onRecord(numBytes, >>>>>>>> > > >>>>>>>>> eventTime) >>>>>>>> > > >>>>>>>>>>>>>> method for connector writers. It should meet >>>>>>>> most of the >>>>>>>> > > >>>>>>>>> requirements, >>>>>>>> > > >>>>>>>>>>>>>> individual >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> The configurable metrics feature is something >>>>>>>> really >>>>>>>> > useful, >>>>>>>> > > >>>>>>>>>>>> especially if >>>>>>>> > > >>>>>>>>>>>>>> we can somehow make it dynamically configurable >>>>>>>> without >>>>>>>> > > >> stopping >>>>>>>> > > >>>>>>> the >>>>>>>> > > >>>>>>>>>>>> jobs. >>>>>>>> > > >>>>>>>>>>>>>> It might be better to make it a separate >>>>>>>> discussion >>>>>>>> > because >>>>>>>> > > it >>>>>>>> > > >>>>> is a >>>>>>>> > > >>>>>>>>>>>> more >>>>>>>> > > >>>>>>>>>>>>>> generic feature instead of only for connectors. >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> So in order to make some progress, in this FLIP >>>>>>>> we can >>>>>>>> > limit >>>>>>>> > > >> the >>>>>>>> > > >>>>>>>>>>>> discussion >>>>>>>> > > >>>>>>>>>>>>>> scope to the connector related items: >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> - the standard connector metric names and types. >>>>>>>> > > >>>>>>>>>>>>>> - the abstract ConnectorMetricHandler interface >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> I'll start a separate thread to discuss other >>>>>>>> general >>>>>>>> > metric >>>>>>>> > > >>>>>>> related >>>>>>>> > > >>>>>>>>>>>>>> enhancement items including: >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> - optional metrics >>>>>>>> > > >>>>>>>>>>>>>> - dynamic metric configuration >>>>>>>> > > >>>>>>>>>>>>>> - potential combination with rate limiter >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> Does this plan sound reasonable? >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen < >>>>>>>> > > >> se...@apache.org> >>>>>>>> > > >>>>>>>>> wrote: >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> Ignoring for a moment implementation details, >>>>>>>> this >>>>>>>> > > connector >>>>>>>> > > >>>>>>> metrics >>>>>>>> > > >>>>>>>>>>>> work >>>>>>>> > > >>>>>>>>>>>>>>> is a really good thing to do, in my opinion >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> The questions "oh, my job seems to be doing >>>>>>>> nothing, I am >>>>>>>> > > >>>>> looking >>>>>>>> > > >>>>>>> at >>>>>>>> > > >>>>>>>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>> UI >>>>>>>> > > >>>>>>>>>>>>>>> and the 'records in' value is still zero" is in >>>>>>>> the top >>>>>>>> > > three >>>>>>>> > > >>>>>>>>> support >>>>>>>> > > >>>>>>>>>>>>>>> questions I have been asked personally. >>>>>>>> > > >>>>>>>>>>>>>>> Introspection into "how far is the consumer >>>>>>>> lagging >>>>>>>> > behind" >>>>>>>> > > >>>>> (event >>>>>>>> > > >>>>>>>>>>>> time >>>>>>>> > > >>>>>>>>>>>>>>> fetch latency) came up many times as well. >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> So big +1 to solving this problem. >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> About the exact design - I would try to go for >>>>>>>> the >>>>>>>> > > following >>>>>>>> > > >>>>>>>>>>>> properties: >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> - keep complexity of of connectors. Ideally the >>>>>>>> metrics >>>>>>>> > > >> handler >>>>>>>> > > >>>>>>> has >>>>>>>> > > >>>>>>>>> a >>>>>>>> > > >>>>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or >>>>>>>> so, and >>>>>>>> > > >>>>> everything >>>>>>>> > > >>>>>>>>>>>> else is >>>>>>>> > > >>>>>>>>>>>>>>> internal to the handler. That makes it dead >>>>>>>> simple for >>>>>>>> > the >>>>>>>> > > >>>>>>>>> connector. >>>>>>>> > > >>>>>>>>>>>> We >>>>>>>> > > >>>>>>>>>>>>>>> can also think of an extensive scheme for >>>>>>>> connector >>>>>>>> > > specific >>>>>>>> > > >>>>>>>>> metrics. >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> - make it configurable on the job it cluster >>>>>>>> level which >>>>>>>> > > >>>>> metrics >>>>>>>> > > >>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>>> handler internally creates when that method is >>>>>>>> invoked. >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> What do you think? >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> Best, >>>>>>>> > > >>>>>>>>>>>>>>> Stephan >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay >>>>>>>> Schepler < >>>>>>>> > > >>>>>>>>> ches...@apache.org >>>>>>>> > > >>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> wrote: >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>> As I said before, I believe this to be >>>>>>>> over-engineered >>>>>>>> > and >>>>>>>> > > >>>>> have >>>>>>>> > > >>>>>>> no >>>>>>>> > > >>>>>>>>>>>>>>>> interest in this implementation. >>>>>>>> > > >>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>> There are conceptual issues like defining a >>>>>>>> duplicate >>>>>>>> > > >>>>>>>>>>>>>> numBytesIn(PerSec) >>>>>>>> > > >>>>>>>>>>>>>>>> metric that already exists for each operator. >>>>>>>> > > >>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>> A few updates to the thread. I uploaded a >>>>>>>> patch[1] as a >>>>>>>> > > >>>>> complete >>>>>>>> > > >>>>>>>>>>>>>>>>> example of how users can use the metrics in >>>>>>>> the future. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Some thoughts below after taking a look at the >>>>>>>> > > >>>>>>> AbstractMetricGroup >>>>>>>> > > >>>>>>>>>>>>>> and >>>>>>>> > > >>>>>>>>>>>>>>>>> its subclasses. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> This patch intends to provide convenience for >>>>>>>> Flink >>>>>>>> > > >> connector >>>>>>>> > > >>>>>>>>>>>>>>>>> implementations to follow metrics standards >>>>>>>> proposed in >>>>>>>> > > >>>>> FLIP-33. >>>>>>>> > > >>>>>>>>> It >>>>>>>> > > >>>>>>>>>>>>>>>>> also try to enhance the metric management in >>>>>>>> general >>>>>>>> > way >>>>>>>> > > to >>>>>>>> > > >>>>> help >>>>>>>> > > >>>>>>>>>>>>>> users >>>>>>>> > > >>>>>>>>>>>>>>>>> with: >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> 1. metric definition >>>>>>>> > > >>>>>>>>>>>>>>>>> 2. metric dependencies check >>>>>>>> > > >>>>>>>>>>>>>>>>> 3. metric validation >>>>>>>> > > >>>>>>>>>>>>>>>>> 4. metric control (turn on / off particular >>>>>>>> metrics) >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the >>>>>>>> > > functionality >>>>>>>> > > >> of >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses. The >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the >>>>>>>> metric group >>>>>>>> > > >>>>>>> hierarchy, >>>>>>>> > > >>>>>>>>>>>> but >>>>>>>> > > >>>>>>>>>>>>>>>>> does not really manage the metrics other than >>>>>>>> keeping >>>>>>>> > > them >>>>>>>> > > >>>>> in a >>>>>>>> > > >>>>>>>>> Map. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Ideally we should only have one entry point >>>>>>>> for the >>>>>>>> > > >> metrics. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Right now the entry point is >>>>>>>> |AbstractMetricGroup|. >>>>>>>> > > >> However, >>>>>>>> > > >>>>>>>>> besides >>>>>>>> > > >>>>>>>>>>>>>>>>> the missing functionality mentioned above, >>>>>>>> > > >>>>> |AbstractMetricGroup| >>>>>>>> > > >>>>>>>>>>>>>> seems >>>>>>>> > > >>>>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could >>>>>>>> extract it out >>>>>>>> > > to >>>>>>>> > > >>>>>>>>>>>>>>>>> flink-metrics in order to use it for generic >>>>>>>> purpose. >>>>>>>> > > There >>>>>>>> > > >>>>> will >>>>>>>> > > >>>>>>>>> be >>>>>>>> > > >>>>>>>>>>>>>>>>> some work, though. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Another approach is to make |AbstractMetrics| >>>>>>>> in this >>>>>>>> > > patch >>>>>>>> > > >>>>> as >>>>>>>> > > >>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>>>>> metric entry point. It wraps metric group and >>>>>>>> provides >>>>>>>> > > the >>>>>>>> > > >>>>>>> missing >>>>>>>> > > >>>>>>>>>>>>>>>>> functionalities. Then we can roll out this >>>>>>>> pattern to >>>>>>>> > > >> runtime >>>>>>>> > > >>>>>>>>>>>>>>>>> components gradually as well. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> My first thought is that the latter approach >>>>>>>> gives a >>>>>>>> > more >>>>>>>> > > >>>>> smooth >>>>>>>> > > >>>>>>>>>>>>>>>>> migration. But I am also OK with doing a >>>>>>>> refactoring on >>>>>>>> > > the >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| family. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> [1] https://github.com/becketqin/flink/pull/1 >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin < >>>>>>>> > > >>>>>>> becket....@gmail.com >>>>>>>> > > >>>>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Hi Chesnay, >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> It might be easier to discuss some >>>>>>>> implementation >>>>>>>> > details >>>>>>>> > > >> in >>>>>>>> > > >>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>>>>> PR review instead of in the FLIP discussion >>>>>>>> thread. I >>>>>>>> > > have >>>>>>>> > > >> a >>>>>>>> > > >>>>>>>>>>>>>> patch >>>>>>>> > > >>>>>>>>>>>>>>>>> for Kafka connectors ready but haven't >>>>>>>> submitted the PR >>>>>>>> > > >> yet. >>>>>>>> > > >>>>>>>>>>>>>>>>> Hopefully that will help explain a bit more. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> ** Re: metric type binding >>>>>>>> > > >>>>>>>>>>>>>>>>> This is a valid point that worths discussing. >>>>>>>> If I >>>>>>>> > > >> understand >>>>>>>> > > >>>>>>>>>>>>>>>>> correctly, there are two points: >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> 1. Metric type / interface does not matter as >>>>>>>> long as >>>>>>>> > the >>>>>>>> > > >>>>>>> metric >>>>>>>> > > >>>>>>>>>>>>>>>>> semantic is clearly defined. >>>>>>>> > > >>>>>>>>>>>>>>>>> Conceptually speaking, I agree that as long >>>>>>>> as the >>>>>>>> > metric >>>>>>>> > > >>>>>>>>>>>>>> semantic >>>>>>>> > > >>>>>>>>>>>>>>>>> is defined, metric type does not matter. To >>>>>>>> some >>>>>>>> > extent, >>>>>>>> > > >>>>> Gauge >>>>>>>> > > >>>>>>> / >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can be >>>>>>>> think of >>>>>>>> > as >>>>>>>> > > >>>>> some >>>>>>>> > > >>>>>>>>>>>>>>>>> well-recognized semantics, if you wish. In >>>>>>>> Flink, these >>>>>>>> > > >>>>> metric >>>>>>>> > > >>>>>>>>>>>>>>>>> semantics have their associated interface >>>>>>>> classes. In >>>>>>>> > > >>>>> practice, >>>>>>>> > > >>>>>>>>>>>>>>>>> such semantic to interface binding seems >>>>>>>> necessary for >>>>>>>> > > >>>>>>> different >>>>>>>> > > >>>>>>>>>>>>>>>>> components to communicate. Simply >>>>>>>> standardize the >>>>>>>> > > semantic >>>>>>>> > > >>>>> of >>>>>>>> > > >>>>>>>>>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>>>>> connector metrics seems not sufficient for >>>>>>>> people to >>>>>>>> > > build >>>>>>>> > > >>>>>>>>>>>>>>>>> ecosystem on top of. At the end of the day, >>>>>>>> we still >>>>>>>> > need >>>>>>>> > > >> to >>>>>>>> > > >>>>>>>>> have >>>>>>>> > > >>>>>>>>>>>>>>>>> some embodiment of the metric semantics that >>>>>>>> people can >>>>>>>> > > >>>>> program >>>>>>>> > > >>>>>>>>>>>>>>>>> against. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be >>>>>>>> exposed >>>>>>>> > > using >>>>>>>> > > >>>>>>>>>>>>>>>>> different metric types / interfaces. >>>>>>>> > > >>>>>>>>>>>>>>>>> This is a good point. Counter and >>>>>>>> Gauge-as-a-Counter >>>>>>>> > are >>>>>>>> > > >>>>> pretty >>>>>>>> > > >>>>>>>>>>>>>>>>> much interchangeable. This is more of a >>>>>>>> trade-off >>>>>>>> > between >>>>>>>> > > >> the >>>>>>>> > > >>>>>>>>>>>>>> user >>>>>>>> > > >>>>>>>>>>>>>>>>> experience of metric producers and consumers. >>>>>>>> The >>>>>>>> > metric >>>>>>>> > > >>>>>>>>>>>>>> producers >>>>>>>> > > >>>>>>>>>>>>>>>>> want to use Counter or Gauge depending on >>>>>>>> whether the >>>>>>>> > > >> counter >>>>>>>> > > >>>>>>> is >>>>>>>> > > >>>>>>>>>>>>>>>>> already tracked in code, while ideally the >>>>>>>> metric >>>>>>>> > > consumers >>>>>>>> > > >>>>>>> only >>>>>>>> > > >>>>>>>>>>>>>>>>> want to see a single metric type for each >>>>>>>> metric. I am >>>>>>>> > > >>>>> leaning >>>>>>>> > > >>>>>>>>>>>>>>>>> towards to make the metric producers happy, >>>>>>>> i.e. allow >>>>>>>> > > >> Gauge >>>>>>>> > > >>>>> / >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter metric type, and the the metric >>>>>>>> consumers >>>>>>>> > handle >>>>>>>> > > >> the >>>>>>>> > > >>>>>>>>> type >>>>>>>> > > >>>>>>>>>>>>>>>>> variation. The reason is that in practice, >>>>>>>> there might >>>>>>>> > be >>>>>>>> > > >>>>> more >>>>>>>> > > >>>>>>>>>>>>>>>>> connector implementations than metric reporter >>>>>>>> > > >>>>> implementations. >>>>>>>> > > >>>>>>>>>>>>>> We >>>>>>>> > > >>>>>>>>>>>>>>>>> could also provide some helper method to >>>>>>>> facilitate >>>>>>>> > > reading >>>>>>>> > > >>>>>>> from >>>>>>>> > > >>>>>>>>>>>>>>>>> such variable metric type. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Just some quick replies to the comments around >>>>>>>> > > >> implementation >>>>>>>> > > >>>>>>>>>>>>>>>> details. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> 4) single place where metrics are >>>>>>>> registered except >>>>>>>> > > >>>>>>>>>>>>>>>>> connector-specific >>>>>>>> > > >>>>>>>>>>>>>>>>> ones (which we can't really avoid). >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Register connector specific ones in a single >>>>>>>> place is >>>>>>>> > > >>>>> actually >>>>>>>> > > >>>>>>>>>>>>>>>>> something that I want to achieve. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> 2) I'm talking about time-series databases >>>>>>>> like >>>>>>>> > > >>>>> Prometheus. >>>>>>>> > > >>>>>>>>>>>>>> We >>>>>>>> > > >>>>>>>>>>>>>>>>> would >>>>>>>> > > >>>>>>>>>>>>>>>>> only have a gauge metric exposing the last >>>>>>>> > > >>>>>>>>> fetchTime/emitTime >>>>>>>> > > >>>>>>>>>>>>>>>>> that is >>>>>>>> > > >>>>>>>>>>>>>>>>> regularly reported to the backend >>>>>>>> (Prometheus), >>>>>>>> > where a >>>>>>>> > > >>>>>>> user >>>>>>>> > > >>>>>>>>>>>>>>>>> could build >>>>>>>> > > >>>>>>>>>>>>>>>>> a histogram of his choosing when/if he >>>>>>>> wants it. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Not sure if such downsampling works. As an >>>>>>>> example, if >>>>>>>> > a >>>>>>>> > > >> user >>>>>>>> > > >>>>>>>>>>>>>>>>> complains that there are some intermittent >>>>>>>> latency >>>>>>>> > spikes >>>>>>>> > > >>>>>>> (maybe >>>>>>>> > > >>>>>>>>>>>>>> a >>>>>>>> > > >>>>>>>>>>>>>>>>> few records in 10 seconds) in their >>>>>>>> processing system. >>>>>>>> > > >>>>> Having a >>>>>>>> > > >>>>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems >>>>>>>> unlikely >>>>>>>> > > useful. >>>>>>>> > > >>>>>>>>>>>>>>>>> However by looking at actual 99.9 percentile >>>>>>>> latency >>>>>>>> > > might >>>>>>>> > > >>>>>>> help. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay >>>>>>>> Schepler >>>>>>>> > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto: >>>>>>>> ches...@apache.org>> >>>>>>>> > wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Re: over complication of implementation. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> I think I get understand better know what >>>>>>>> you're >>>>>>>> > > >> shooting >>>>>>>> > > >>>>>>>>>>>>>> for, >>>>>>>> > > >>>>>>>>>>>>>>>>> effectively something like the >>>>>>>> OperatorIOMetricGroup. >>>>>>>> > > >>>>>>>>>>>>>>>>> But still, re-define >>>>>>>> setupConnectorMetrics() to >>>>>>>> > accept >>>>>>>> > > a >>>>>>>> > > >>>>>>> set >>>>>>>> > > >>>>>>>>>>>>>>>>> of flags >>>>>>>> > > >>>>>>>>>>>>>>>>> for counters/meters(ans _possibly_ >>>>>>>> histograms) along >>>>>>>> > > >>>>> with a >>>>>>>> > > >>>>>>>>>>>>>>>>> set of >>>>>>>> > > >>>>>>>>>>>>>>>>> well-defined Optional<Gauge<?>>, and return >>>>>>>> the >>>>>>>> > group. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Solves all issues as far as i can tell: >>>>>>>> > > >>>>>>>>>>>>>>>>> 1) no metrics must be created manually >>>>>>>> (except >>>>>>>> > Gauges, >>>>>>>> > > >>>>>>> which >>>>>>>> > > >>>>>>>>>>>>>>> are >>>>>>>> > > >>>>>>>>>>>>>>>>> effectively just Suppliers and you can't >>>>>>>> get around >>>>>>>> > > >>>>> this), >>>>>>>> > > >>>>>>>>>>>>>>>>> 2) additional metrics can be registered on >>>>>>>> the >>>>>>>> > returned >>>>>>>> > > >>>>>>>>>>>>>> group, >>>>>>>> > > >>>>>>>>>>>>>>>>> 3) see 1), >>>>>>>> > > >>>>>>>>>>>>>>>>> 4) single place where metrics are >>>>>>>> registered except >>>>>>>> > > >>>>>>>>>>>>>>>>> connector-specific >>>>>>>> > > >>>>>>>>>>>>>>>>> ones (which we can't really avoid). >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> Re: Histogram >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> 1) As an example, whether "numRecordsIn" is >>>>>>>> exposed >>>>>>>> > as >>>>>>>> > > a >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter or a >>>>>>>> > > >>>>>>>>>>>>>>>>> Gauge should be irrelevant. So far we're >>>>>>>> using the >>>>>>>> > > >> metric >>>>>>>> > > >>>>>>>>>>>>>> type >>>>>>>> > > >>>>>>>>>>>>>>>>> that is >>>>>>>> > > >>>>>>>>>>>>>>>>> the most convenient at exposing a given >>>>>>>> value. If >>>>>>>> > there >>>>>>>> > > >>>>> is >>>>>>>> > > >>>>>>>>>>>>>>>>> some backing >>>>>>>> > > >>>>>>>>>>>>>>>>> data-structure that we want to expose some >>>>>>>> data from >>>>>>>> > we >>>>>>>> > > >>>>>>>>>>>>>>>>> typically opt >>>>>>>> > > >>>>>>>>>>>>>>>>> for a Gauge, as otherwise we're just >>>>>>>> mucking around >>>>>>>> > > with >>>>>>>> > > >>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>>>>> Meter/Counter API to get it to match. >>>>>>>> Similarly, if >>>>>>>> > we >>>>>>>> > > >>>>> want >>>>>>>> > > >>>>>>>>>>>>>> to >>>>>>>> > > >>>>>>>>>>>>>>>>> count >>>>>>>> > > >>>>>>>>>>>>>>>>> something but no current count exists we >>>>>>>> typically >>>>>>>> > > added >>>>>>>> > > >>>>> a >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter. >>>>>>>> > > >>>>>>>>>>>>>>>>> That's why attaching semantics to metric >>>>>>>> types makes >>>>>>>> > > >>>>> little >>>>>>>> > > >>>>>>>>>>>>>>>>> sense (but >>>>>>>> > > >>>>>>>>>>>>>>>>> unfortunately several reporters already do >>>>>>>> it); for >>>>>>>> > > >>>>>>>>>>>>>>>>> counters/meters >>>>>>>> > > >>>>>>>>>>>>>>>>> certainly, but the majority of metrics are >>>>>>>> gauges. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> 2) I'm talking about time-series databases >>>>>>>> like >>>>>>>> > > >>>>> Prometheus. >>>>>>>> > > >>>>>>>>>>>>>> We >>>>>>>> > > >>>>>>>>>>>>>>>>> would >>>>>>>> > > >>>>>>>>>>>>>>>>> only have a gauge metric exposing the last >>>>>>>> > > >>>>>>>>> fetchTime/emitTime >>>>>>>> > > >>>>>>>>>>>>>>>>> that is >>>>>>>> > > >>>>>>>>>>>>>>>>> regularly reported to the backend >>>>>>>> (Prometheus), >>>>>>>> > where a >>>>>>>> > > >>>>>>> user >>>>>>>> > > >>>>>>>>>>>>>>>>> could build >>>>>>>> > > >>>>>>>>>>>>>>>>> a histogram of his choosing when/if he >>>>>>>> wants it. >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> On 22.02.2019 13:57, Becket Qin wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>> Hi Chesnay, >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks for the explanation. >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> ** Re: FLIP >>>>>>>> > > >>>>>>>>>>>>>>>>>> I might have misunderstood this, but it >>>>>>>> seems that >>>>>>>> > > "major >>>>>>>> > > >>>>>>>>>>>>>>>>> changes" are well >>>>>>>> > > >>>>>>>>>>>>>>>>>> defined in FLIP. The full contents is >>>>>>>> following: >>>>>>>> > > >>>>>>>>>>>>>>>>>> What is considered a "major change" that >>>>>>>> needs a FLIP? >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Any of the following should be considered a >>>>>>>> major >>>>>>>> > > change: >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Any major new feature, subsystem, or piece >>>>>>>> of >>>>>>>> > > >>>>>>>>>>>>>>>>> functionality >>>>>>>> > > >>>>>>>>>>>>>>>>>> - *Any change that impacts the public >>>>>>>> interfaces of >>>>>>>> > the >>>>>>>> > > >>>>>>>>>>>>>>>>> project* >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> What are the "public interfaces" of the >>>>>>>> project? >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> *All of the following are public interfaces >>>>>>>> *that >>>>>>>> > people >>>>>>>> > > >>>>>>>>>>>>>>>>> build around: >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - DataStream and DataSet API, including >>>>>>>> classes >>>>>>>> > related >>>>>>>> > > >>>>>>>>>>>>>>>>> to that, such as >>>>>>>> > > >>>>>>>>>>>>>>>>>> StreamExecutionEnvironment >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Classes marked with the @Public annotation >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - On-disk binary formats, such as >>>>>>>> > > >>>>>>>>>>>>>> checkpoints/savepoints >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - User-facing scripts/command-line tools, >>>>>>>> i.e. >>>>>>>> > > >>>>>>>>>>>>>>>>> bin/flink, Yarn scripts, >>>>>>>> > > >>>>>>>>>>>>>>>>>> Mesos scripts >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Configuration settings >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> - *Exposed monitoring information* >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> So any monitoring information change is >>>>>>>> considered as >>>>>>>> > > >>>>>>>>>>>>>> public >>>>>>>> > > >>>>>>>>>>>>>>>>> interface, and >>>>>>>> > > >>>>>>>>>>>>>>>>>> any public interface change is considered as >>>>>>>> a "major >>>>>>>> > > >>>>>>>>>>>>>>> change". >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> ** Re: over complication of implementation. >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Although this is more of implementation >>>>>>>> details that >>>>>>>> > is >>>>>>>> > > >> not >>>>>>>> > > >>>>>>>>>>>>>>>>> covered by the >>>>>>>> > > >>>>>>>>>>>>>>>>>> FLIP. But it may be worth discussing. >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> First of all, I completely agree that we >>>>>>>> should use >>>>>>>> > the >>>>>>>> > > >>>>>>>>>>>>>>>>> simplest way to >>>>>>>> > > >>>>>>>>>>>>>>>>>> achieve our goal. To me the goal is the >>>>>>>> following: >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Clear connector conventions and >>>>>>>> interfaces. >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. The easiness of creating a connector. >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Both of them are important to the prosperity >>>>>>>> of the >>>>>>>> > > >>>>>>>>>>>>>>>>> connector ecosystem. So >>>>>>>> > > >>>>>>>>>>>>>>>>>> I'd rather abstract as much as possible on >>>>>>>> our side to >>>>>>>> > > >> make >>>>>>>> > > >>>>>>>>>>>>>>>>> the connector >>>>>>>> > > >>>>>>>>>>>>>>>>>> developer's work lighter. Given this goal, a >>>>>>>> static >>>>>>>> > util >>>>>>>> > > >>>>>>>>>>>>>>>>> method approach >>>>>>>> > > >>>>>>>>>>>>>>>>>> might have a few drawbacks: >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Users still have to construct the metrics >>>>>>>> by >>>>>>>> > > >> themselves. >>>>>>>> > > >>>>>>>>>>>>>>>>> (And note that >>>>>>>> > > >>>>>>>>>>>>>>>>>> this might be erroneous by itself. For >>>>>>>> example, a >>>>>>>> > > customer >>>>>>>> > > >>>>>>>>>>>>>>>>> wrapper around >>>>>>>> > > >>>>>>>>>>>>>>>>>> dropwizard meter maybe used instead of >>>>>>>> MeterView). >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. When connector specific metrics are >>>>>>>> added, it is >>>>>>>> > > >>>>>>>>>>>>>>>>> difficult to enforce >>>>>>>> > > >>>>>>>>>>>>>>>>>> the scope to be the same as standard metrics. >>>>>>>> > > >>>>>>>>>>>>>>>>>> 3. It seems that a method proliferation is >>>>>>>> inevitable >>>>>>>> > if >>>>>>>> > > >> we >>>>>>>> > > >>>>>>>>>>>>>>>>> want to apply >>>>>>>> > > >>>>>>>>>>>>>>>>>> sanity checks. e.g. The metric of numBytesIn >>>>>>>> was not >>>>>>>> > > >>>>>>>>>>>>>>>>> registered for a meter. >>>>>>>> > > >>>>>>>>>>>>>>>>>> 4. Metrics are still defined in random >>>>>>>> places and hard >>>>>>>> > > to >>>>>>>> > > >>>>>>>>>>>>>>>> track. >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> The current PR I had was inspired by the >>>>>>>> Config system >>>>>>>> > > in >>>>>>>> > > >>>>>>>>>>>>>>>>> Kafka, which I >>>>>>>> > > >>>>>>>>>>>>>>>>>> found pretty handy. In fact it is not only >>>>>>>> used by >>>>>>>> > Kafka >>>>>>>> > > >>>>>>>>>>>>>>>>> itself but even >>>>>>>> > > >>>>>>>>>>>>>>>>>> some other projects that depend on Kafka. I >>>>>>>> am not >>>>>>>> > > saying >>>>>>>> > > >>>>>>>>>>>>>>>>> this approach is >>>>>>>> > > >>>>>>>>>>>>>>>>>> perfect. But I think it worths to save the >>>>>>>> work for >>>>>>>> > > >>>>>>>>>>>>>>>>> connector writers and >>>>>>>> > > >>>>>>>>>>>>>>>>>> encourage more systematic implementation. >>>>>>>> That being >>>>>>>> > > said, >>>>>>>> > > >>>>>>>>>>>>>> I >>>>>>>> > > >>>>>>>>>>>>>>>>> am fully open >>>>>>>> > > >>>>>>>>>>>>>>>>>> to suggestions. >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Re: Histogram >>>>>>>> > > >>>>>>>>>>>>>>>>>> I think there are two orthogonal questions >>>>>>>> around >>>>>>>> > those >>>>>>>> > > >>>>>>>>>>>>>>>> metrics: >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just >>>>>>>> looking at >>>>>>>> > the >>>>>>>> > > >>>>>>>>>>>>>>>>> meaning of a >>>>>>>> > > >>>>>>>>>>>>>>>>>> metric, is generic to all connectors? If the >>>>>>>> answer is >>>>>>>> > > >> yes, >>>>>>>> > > >>>>>>>>>>>>>>>>> we should >>>>>>>> > > >>>>>>>>>>>>>>>>>> include the metric into the convention. No >>>>>>>> matter >>>>>>>> > > whether >>>>>>>> > > >>>>>>>>>>>>>> we >>>>>>>> > > >>>>>>>>>>>>>>>>> include it >>>>>>>> > > >>>>>>>>>>>>>>>>>> into the convention or not, some connector >>>>>>>> > > implementations >>>>>>>> > > >>>>>>>>>>>>>>>>> will emit such >>>>>>>> > > >>>>>>>>>>>>>>>>>> metric. It is better to have a convention >>>>>>>> than letting >>>>>>>> > > >> each >>>>>>>> > > >>>>>>>>>>>>>>>>> connector do >>>>>>>> > > >>>>>>>>>>>>>>>>>> random things. >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. If a standard metric is a histogram, what >>>>>>>> should we >>>>>>>> > > do? >>>>>>>> > > >>>>>>>>>>>>>>>>>> I agree that we should make it clear that >>>>>>>> using >>>>>>>> > > histograms >>>>>>>> > > >>>>>>>>>>>>>>>>> will have >>>>>>>> > > >>>>>>>>>>>>>>>>>> performance risk. But I do see histogram is >>>>>>>> useful in >>>>>>>> > > some >>>>>>>> > > >>>>>>>>>>>>>>>>> fine-granularity >>>>>>>> > > >>>>>>>>>>>>>>>>>> debugging where one do not have the luxury >>>>>>>> to stop the >>>>>>>> > > >>>>>>>>>>>>>>>>> system and inject >>>>>>>> > > >>>>>>>>>>>>>>>>>> more inspection code. So the workaround I am >>>>>>>> thinking >>>>>>>> > is >>>>>>>> > > >> to >>>>>>>> > > >>>>>>>>>>>>>>>>> provide some >>>>>>>> > > >>>>>>>>>>>>>>>>>> implementation suggestions. Assume later on >>>>>>>> we have a >>>>>>>> > > >>>>>>>>>>>>>>>>> mechanism of >>>>>>>> > > >>>>>>>>>>>>>>>>>> selective metrics. In the abstract metrics >>>>>>>> class we >>>>>>>> > can >>>>>>>> > > >>>>>>>>>>>>>>>>> disable those >>>>>>>> > > >>>>>>>>>>>>>>>>>> metrics by default individual connector >>>>>>>> writers does >>>>>>>> > not >>>>>>>> > > >>>>>>>>>>>>>>>>> have to do >>>>>>>> > > >>>>>>>>>>>>>>>>>> anything (this is another advantage of >>>>>>>> having an >>>>>>>> > > >>>>>>>>>>>>>>>>> AbstractMetrics instead of >>>>>>>> > > >>>>>>>>>>>>>>>>>> static util methods.) >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> I am not sure I fully understand the >>>>>>>> histogram in the >>>>>>>> > > >>>>>>>>>>>>>>>>> backend approach. Can >>>>>>>> > > >>>>>>>>>>>>>>>>>> you explain a bit more? Do you mean emitting >>>>>>>> the raw >>>>>>>> > > data, >>>>>>>> > > >>>>>>>>>>>>>>>>> e.g. fetchTime >>>>>>>> > > >>>>>>>>>>>>>>>>>> and emitTime with each record and let the >>>>>>>> histogram >>>>>>>> > > >>>>>>>>>>>>>>>>> computation happen in >>>>>>>> > > >>>>>>>>>>>>>>>>>> the background? Or let the processing thread >>>>>>>> putting >>>>>>>> > the >>>>>>>> > > >>>>>>>>>>>>>>>>> values into a >>>>>>>> > > >>>>>>>>>>>>>>>>>> queue and have a separate thread polling >>>>>>>> from the >>>>>>>> > queue >>>>>>>> > > >> and >>>>>>>> > > >>>>>>>>>>>>>>>>> add them into >>>>>>>> > > >>>>>>>>>>>>>>>>>> the histogram? >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay >>>>>>>> Schepler >>>>>>>> > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto: >>>>>>>> ches...@apache.org>> >>>>>>>> > > wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: Flip >>>>>>>> > > >>>>>>>>>>>>>>>>>>> The very first line under both the main >>>>>>>> header and >>>>>>>> > > >> Purpose >>>>>>>> > > >>>>>>>>>>>>>>>>> section >>>>>>>> > > >>>>>>>>>>>>>>>>>>> describe Flips as "major changes", which >>>>>>>> this isn't. >>>>>>>> > > >>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: complication >>>>>>>> > > >>>>>>>>>>>>>>>>>>> I'm not arguing against standardization, >>>>>>>> but again an >>>>>>>> > > >>>>>>>>>>>>>>>>> over-complicated >>>>>>>> > > >>>>>>>>>>>>>>>>>>> implementation when a static utility method >>>>>>>> would be >>>>>>>> > > >>>>>>>>>>>>>>>>> sufficient. >>>>>>>> > > >>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> public static void setupConnectorMetrics( >>>>>>>> > > >>>>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup, >>>>>>>> > > >>>>>>>>>>>>>>>>>>> String connectorName, >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn, >>>>>>>> > > >>>>>>>>>>>>>>>>>>> ...) >>>>>>>> > > >>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> This gives you all you need: >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * a well-defined set of metrics for a >>>>>>>> connector to >>>>>>>> > > opt-in >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * standardized naming schemes for scope and >>>>>>>> > individual >>>>>>>> > > >>>>>>>>>>>>>>> metrics >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * standardize metric types (although >>>>>>>> personally I'm >>>>>>>> > not >>>>>>>> > > >>>>>>>>>>>>>>>>> interested in that >>>>>>>> > > >>>>>>>>>>>>>>>>>>> since metric types should be considered >>>>>>>> syntactic >>>>>>>> > > sugar) >>>>>>>> > > >>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: Configurable Histogram >>>>>>>> > > >>>>>>>>>>>>>>>>>>> If anything they _must_ be turned off by >>>>>>>> default, but >>>>>>>> > > the >>>>>>>> > > >>>>>>>>>>>>>>>>> metric system is >>>>>>>> > > >>>>>>>>>>>>>>>>>>> already exposing so many options that I'm >>>>>>>> not too >>>>>>>> > keen >>>>>>>> > > on >>>>>>>> > > >>>>>>>>>>>>>>>>> adding even more. >>>>>>>> > > >>>>>>>>>>>>>>>>>>> You have also only addressed my first >>>>>>>> argument >>>>>>>> > against >>>>>>>> > > >>>>>>>>>>>>>>>>> histograms >>>>>>>> > > >>>>>>>>>>>>>>>>>>> (performance), the second one still stands >>>>>>>> (calculate >>>>>>>> > > >>>>>>>>>>>>>>>>> histogram in metric >>>>>>>> > > >>>>>>>>>>>>>>>>>>> backends instead). >>>>>>>> > > >>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Hi Chesnay, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is >>>>>>>> worthy of a >>>>>>>> > > >> FLIP >>>>>>>> > > >>>>>>>>>>>>>>>>> because it is >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> public API. According to the FLIP >>>>>>>> description a FlIP >>>>>>>> > > is >>>>>>>> > > >>>>>>>>>>>>>>>>> required in case >>>>>>>> > > >>>>>>>>>>>>>>>>>>> of: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - Any change that impacts the public >>>>>>>> interfaces of >>>>>>>> > > >>>>>>>>>>>>>>>>> the project >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> and the following entry is found in the >>>>>>>> definition >>>>>>>> > of >>>>>>>> > > >>>>>>>>>>>>>>>>> "public interface". >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - Exposed monitoring information >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Metrics are critical to any production >>>>>>>> system. So a >>>>>>>> > > >> clear >>>>>>>> > > >>>>>>>>>>>>>>>>> metric >>>>>>>> > > >>>>>>>>>>>>>>>>>>> definition >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> is important for any serious users. For an >>>>>>>> > > organization >>>>>>>> > > >>>>>>>>>>>>>>>>> with large Flink >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> installation, change in metrics means >>>>>>>> great amount >>>>>>>> > of >>>>>>>> > > >>>>>>>>>>>>>>>>> work. So such >>>>>>>> > > >>>>>>>>>>>>>>>>>>> changes >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> do need to be fully discussed and >>>>>>>> documented. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: Histogram. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> We can discuss whether there is a better >>>>>>>> way to >>>>>>>> > expose >>>>>>>> > > >>>>>>>>>>>>>>>>> metrics that are >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> suitable for histograms. My >>>>>>>> micro-benchmark on >>>>>>>> > various >>>>>>>> > > >>>>>>>>>>>>>>>>> histogram >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> implementations also indicates that they >>>>>>>> are >>>>>>>> > > >>>>>>>>>>>>>> significantly >>>>>>>> > > >>>>>>>>>>>>>>>>> slower than >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> other metric types. But I don't think that >>>>>>>> means >>>>>>>> > never >>>>>>>> > > >>>>>>>>>>>>>> use >>>>>>>> > > >>>>>>>>>>>>>>>>> histogram, but >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> means use it with caution. For example, we >>>>>>>> can >>>>>>>> > suggest >>>>>>>> > > >>>>>>>>>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>>>>>>> implementations >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> to turn them off by default and only turn >>>>>>>> it on for >>>>>>>> > a >>>>>>>> > > >>>>>>>>>>>>>>>>> small amount of >>>>>>>> > > >>>>>>>>>>>>>>>>>>> time >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> when performing some micro-debugging. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: complication: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Connector conventions are essential for >>>>>>>> Flink >>>>>>>> > > ecosystem. >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink connectors >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> pool is probably the most important part >>>>>>>> of Flink, >>>>>>>> > > just >>>>>>>> > > >>>>>>>>>>>>>>>>> like any other >>>>>>>> > > >>>>>>>>>>>>>>>>>>> data >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> system. Clear conventions of connectors >>>>>>>> will help >>>>>>>> > > build >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink ecosystem >>>>>>>> > > >>>>>>>>>>>>>>>>>>> in >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a more organic way. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Take the metrics convention as an example, >>>>>>>> imagine >>>>>>>> > > >>>>>>>>>>>>>> someone >>>>>>>> > > >>>>>>>>>>>>>>>>> has developed >>>>>>>> > > >>>>>>>>>>>>>>>>>>> a >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Flink connector for System foo, and another >>>>>>>> > developer >>>>>>>> > > >> may >>>>>>>> > > >>>>>>>>>>>>>>>>> have developed >>>>>>>> > > >>>>>>>>>>>>>>>>>>> a >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for >>>>>>>> Flink which >>>>>>>> > > >>>>>>>>>>>>>>>>> analyzes the Flink >>>>>>>> > > >>>>>>>>>>>>>>>>>>> job >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> performance based on metrics. With a clear >>>>>>>> metric >>>>>>>> > > >>>>>>>>>>>>>>>>> convention, those two >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> projects could be developed independently. >>>>>>>> Once >>>>>>>> > users >>>>>>>> > > >> put >>>>>>>> > > >>>>>>>>>>>>>>>>> them together, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> it would work without additional >>>>>>>> modifications. This >>>>>>>> > > >>>>>>>>>>>>>>>>> cannot be easily >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> achieved by just defining a few constants. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: selective metrics: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate >>>>>>>> thread. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> @Dawid >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> The primary purpose of establish such a >>>>>>>> convention >>>>>>>> > is >>>>>>>> > > to >>>>>>>> > > >>>>>>>>>>>>>>>>> help developers >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> write connectors in a more compatible way. >>>>>>>> The >>>>>>>> > > >> convention >>>>>>>> > > >>>>>>>>>>>>>>>>> is supposed to >>>>>>>> > > >>>>>>>>>>>>>>>>>>> be >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> defined more proactively. So when look at >>>>>>>> the >>>>>>>> > > >> convention, >>>>>>>> > > >>>>>>>>>>>>>>>>> it seems more >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> important to see if the concept is >>>>>>>> applicable to >>>>>>>> > > >>>>>>>>>>>>>>>>> connectors in general. >>>>>>>> > > >>>>>>>>>>>>>>>>>>> It >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> might be true so far only Kafka connector >>>>>>>> reports >>>>>>>> > > >>>>>>>>>>>>>> latency. >>>>>>>> > > >>>>>>>>>>>>>>>>> But there >>>>>>>> > > >>>>>>>>>>>>>>>>>>> might >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> be hundreds of other connector >>>>>>>> implementations in >>>>>>>> > the >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink ecosystem, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of >>>>>>>> them also >>>>>>>> > > >> emits >>>>>>>> > > >>>>>>>>>>>>>>>>> latency. I >>>>>>>> > > >>>>>>>>>>>>>>>>>>> think >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a lot of other sources actually also has >>>>>>>> an append >>>>>>>> > > >>>>>>>>>>>>>>>>> timestamp. e.g. >>>>>>>> > > >>>>>>>>>>>>>>>>>>> database >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I >>>>>>>> wouldn't be >>>>>>>> > > surprised >>>>>>>> > > >>>>>>>>>>>>>>>>> if some database >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> connector can also emit latency metrics. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay >>>>>>>> Schepler >>>>>>>> > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto: >>>>>>>> ches...@apache.org>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to >>>>>>>> investigate >>>>>>>> > > this >>>>>>>> > > >>>>>>>>>>>>>> as >>>>>>>> > > >>>>>>>>>>>>>>>>> part of this >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of >>>>>>>> interest >>>>>>>> > for >>>>>>>> > > >>>>>>>>>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>>>>> entire metric >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> system, and should be designed for as >>>>>>>> such. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to >>>>>>>> the core of >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink. They are >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> significantly more expensive than other >>>>>>>> metrics, >>>>>>>> > and >>>>>>>> > > >>>>>>>>>>>>>>>>> calculating >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> histograms in the application is regarded >>>>>>>> as an >>>>>>>> > > >>>>>>>>>>>>>>>>> anti-pattern by several >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> metric backends, who instead recommend to >>>>>>>> expose >>>>>>>> > the >>>>>>>> > > >> raw >>>>>>>> > > >>>>>>>>>>>>>>>>> data and >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> calculate the histogram in the backend. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Second, this seems overly complicated. >>>>>>>> Given that >>>>>>>> > we >>>>>>>> > > >>>>>>>>>>>>>>>>> already established >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> that not all connectors will export all >>>>>>>> metrics we >>>>>>>> > > are >>>>>>>> > > >>>>>>>>>>>>>>>>> effectively >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> reducing this down to a consistent naming >>>>>>>> scheme. >>>>>>>> > We >>>>>>>> > > >>>>>>>>>>>>>>>>> don't need anything >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a >>>>>>>> few >>>>>>>> > > constants >>>>>>>> > > >>>>>>>>>>>>>>>>> that all >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connectors use. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of >>>>>>>> a FLIP. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz >>>>>>>> wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I >>>>>>>> agree. But >>>>>>>> > those >>>>>>>> > > >>>>>>>>>>>>>>>>> particular metrics >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now would >>>>>>>> only be >>>>>>>> > > >>>>>>>>>>>>>>>>> reported if user uses >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal >>>>>>>> timestampAssigner with >>>>>>>> > > >>>>>>>>>>>>>>>>> StreamCharacteristic >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds >>>>>>>> like a very >>>>>>>> > > >>>>>>>>>>>>>>>>> specific case. I am >>>>>>>> > > >>>>>>>>>>>>>>>>>>> not >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic >>>>>>>> metric that >>>>>>>> > > will >>>>>>>> > > >>>>>>>>>>>>>> be >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> disabled/absent for most of >>>>>>>> implementations. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal >>>>>>>> issue, that >>>>>>>> > > might >>>>>>>> > > >>>>>>>>>>>>>>>>> make sense to >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> investigate in the future. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Dawid >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Hi Dawid, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes >>>>>>>> sense to me. >>>>>>>> > > >> There >>>>>>>> > > >>>>>>>>>>>>>>>>> are two cases >>>>>>>> > > >>>>>>>>>>>>>>>>>>> to >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> be >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> addressed. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a >>>>>>>> guidance. It >>>>>>>> > is >>>>>>>> > > >>>>>>>>>>>>>>>>> likely that a >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connector >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> only supports some but not all of the >>>>>>>> metrics. In >>>>>>>> > > >> that >>>>>>>> > > >>>>>>>>>>>>>>>>> case, each >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connector >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> implementation should have the freedom >>>>>>>> to decide >>>>>>>> > > >> which >>>>>>>> > > >>>>>>>>>>>>>>>>> metrics are >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are >>>>>>>> supported, the >>>>>>>> > > >>>>>>>>>>>>>>>>> guidance should be >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> followed. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable >>>>>>>> certain >>>>>>>> > > >> metrics >>>>>>>> > > >>>>>>>>>>>>>>>>> for some reason >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of >>>>>>>> data). A >>>>>>>> > > generic >>>>>>>> > > >>>>>>>>>>>>>>>>> mechanism should >>>>>>>> > > >>>>>>>>>>>>>>>>>>> be >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> provided to allow user choose which >>>>>>>> metrics are >>>>>>>> > > >>>>>>>>>>>>>>>>> reported. This >>>>>>>> > > >>>>>>>>>>>>>>>>>>> mechanism >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> should also be honored by the connector >>>>>>>> > > >>>>>>>>>>>>>> implementations. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you? >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid >>>>>>>> Wysakowicz >>>>>>>> > < >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto: >>>>>>>> > > dwysakow...@apache.org >>>>>>>> > > >>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a >>>>>>>> unified, >>>>>>>> > > >>>>>>>>>>>>>>>>> standard set of >>>>>>>> > > >>>>>>>>>>>>>>>>>>> metrics >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> for >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight >>>>>>>> concerns >>>>>>>> > about >>>>>>>> > > >>>>>>>>>>>>>>>>> fetchLatency and >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> latency though. They are computed >>>>>>>> based on >>>>>>>> > > EventTime >>>>>>>> > > >>>>>>>>>>>>>>>>> which is not a >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> purely >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often on >>>>>>>> some >>>>>>>> > > business >>>>>>>> > > >>>>>>>>>>>>>>>>> logic, might be >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> absent >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> or defined after source. Those metrics >>>>>>>> could >>>>>>>> > also >>>>>>>> > > >>>>>>>>>>>>>>>>> behave in a weird >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> way in >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore I >>>>>>>> am not >>>>>>>> > sure >>>>>>>> > > >> if >>>>>>>> > > >>>>>>>>>>>>>>>>> we should >>>>>>>> > > >>>>>>>>>>>>>>>>>>> include >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we >>>>>>>> could at >>>>>>>> > least >>>>>>>> > > >>>>>>>>>>>>>>>>> introduce a feature >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think? >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Dawid >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to the >>>>>>>> proposed >>>>>>>> > > >>>>>>>>>>>>>>>>> metrics. I'll start a >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> voting thread later toady. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket >>>>>>>> Qin >>>>>>>> > > >>>>>>>>>>>>>>>>> <becket....@gmail.com <mailto: >>>>>>>> becket....@gmail.com>> >>>>>>>> > < >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto: >>>>>>>> becket....@gmail.com >>>>>>>> > >> >>>>>>>> > > >>>>>>>>>>>>>>> wrote: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi folks, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP >>>>>>>> discussion thread >>>>>>>> > > >>>>>>>>>>>>>> about >>>>>>>> > > >>>>>>>>>>>>>>>>> standardize >>>>>>>> > > >>>>>>>>>>>>>>>>>>> the >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> connector metrics. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a >>>>>>>> convention >>>>>>>> > of >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink connector >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the >>>>>>>> monitoring >>>>>>>> > and >>>>>>>> > > >>>>>>>>>>>>>>>>> alerting on Flink >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> jobs. >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> The FLIP link is following: >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>> >>>>>>>> > > >> >>>>>>>> > > >>>>>>>> > >>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>>>>> >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > > >>>>> >>>>>>>> > > >>>>> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > >>>>>>>> > > >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Konstantin Knauf >>>>>>>> >>>>>>>> https://twitter.com/snntrable >>>>>>>> >>>>>>>> https://github.com/knaufk >>>>>>>> >>>>>>>