+1 from my side On Fri, Sep 18, 2020 at 4:54 PM Stephan Ewen <step...@ververica.com> wrote:
> Having the watermark lag metric was the important part from my side. > > So +1 to go ahead. > > On Fri, Sep 18, 2020 at 4:11 PM Becket Qin <becket....@gmail.com> wrote: > > > Hey Stephan, > > > > Thanks for the quick reply. I actually forgot to mention that I also > added > > a "watermarkLag" metric to the Source metrics in addition to the > > "currentFetchEventTimeLag" and "currentEmitEventTimeLag". > > So in the current proposal, both XXEventTimeLag will be based on event > > timestamps, while the watermarkLag is based on watermark. And all the > lags > > are defined against the wall clock time. Does this cover what you are > > proposing? > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > On Fri, Sep 18, 2020 at 6:08 PM Stephan Ewen <step...@ververica.com> > > wrote: > > > >> Hi Becket! > >> > >> I am wondering if it makes sense to do the following small change: > >> > >> - Have "currentFetchEventTimeLag" be defined on event timestamps > >> (optionally, if the source system exposes it) <== this is like in your > >> proposal > >> this helps understand how long the records were in the source > before > >> > >> - BUT change "currentEmitEventTimeLag" to be > >> "currentSourceWatermarkLag" instead. > >> That way, users can see how far behind the wall-clock-time the > source > >> progress is all in all. > >> It is also a more well defined metric as it does not oscillate with > >> out-of-order events. > >> > >> What do you think? > >> > >> Best, > >> Stephan > >> > >> > >> > >> On Fri, Sep 18, 2020 at 12:02 PM Becket Qin <becket....@gmail.com> > wrote: > >> > >>> Hi folks, > >>> > >>> Thanks for all the great feedback. I have just updated FLIP-33 wiki > with > >>> the following changes: > >>> > >>> 1. Renaming. "currentFetchLatency" to "currentFetchEventTimeLag", > >>> "currentLatency" to "currentEmitEventTimeLag". > >>> 2. Added the public interface code change required for the new metrics. > >>> 3. Added description of whether a metric is predefined or optional, and > >>> which component is expected to update the metric. > >>> > >>> Please let me know if you have any questions. I'll start a vote in two > >>> days if there are no further concerns. > >>> > >>> Thanks, > >>> > >>> Jiangjie (Becket) Qin > >>> > >>> On Wed, Sep 9, 2020 at 9:56 AM Becket Qin <becket....@gmail.com> > wrote: > >>> > >>>> Hi Stephan, > >>>> > >>>> Thanks for the input. Just a few more clarifications / questions. > >>>> > >>>> *Num Bytes / Records Metrics* > >>>> > >>>> 1. At this point, the *numRecordsIn(Rate)* metrics exist in both > >>>> OperatorIOMetricGroup and TaskIOMetricGroup. I did not find > >>>> *numRecordsIn(Rate)* in the TaskIOMetricGroup updated anywhere other > >>>> than in the unit tests. Am I missing something? > >>>> > >>>> 2. *numBytesIn(Rate)* metrics only exist in TaskIOMetricGroup. At this > >>>> point, the SourceReaders only has access to a SourceReaderContext > which > >>>> provides an OperatorMetricGroup. So it seems that the connector > developers > >>>> are not able to update the *numBytesIn(Rate). *With the multiple > >>>> Source chaining support, it is possible that there are multiple > Sources are > >>>> in the same task. So it looks that we need to add *numBytesIn(Rate)* > >>>> to the operator metrics as well. > >>>> > >>>> > >>>> *Current (Fetch) Latency* > >>>> > >>>> *currentFetchLatency* helps clearly tell whether the latency is caused > >>>> by Flink or not. Backpressure is not the only reason that we see fetch > >>>> latency. Even if there is no back pressure, the records may have > passed a > >>>> long pipeline before they entered Flink. For example, say the > *currentLatency > >>>> *is 10 seconds and there is no backpressure. Does that mean the record > >>>> spent 10 seconds in the Source operator? If not, how much did Flink > >>>> contribute to that 10 seconds of latency? These questions are > frequently > >>>> asked and hard to tell without the fetch latency. > >>>> > >>>> For "currentFetchLatency", we would need to understand timestamps > >>>>> before the records are decoded. That is only possible for some > sources, > >>>>> where the client gives us the records in a (partially) decoded from > already > >>>>> (like Kafka). Then, some work has been done between the fetch time > and the > >>>>> time we update the metric already, so it is already a bit closer to > the > >>>>> "currentFetchLatency". I think following this train of thought, > there is > >>>>> diminished benefit from that specific metric. > >>>> > >>>> > >>>> We may not have to report the fetch latency before records are > decoded. > >>>> One solution is to remember the* FetchTime* when the encoded records > >>>> are fetched, and report the fetch latency after the records are > decoded by > >>>> computing (*FetchTime - EventTime*). An approximate implementation > >>>> would be adding a *FetchTime *field to the *RecordsWithSplitIds* > assuming > >>>> that all the records in that data structure are fetched at the same > time. > >>>> > >>>> Thoughts? > >>>> > >>>> Thanks, > >>>> > >>>> Jiangjie (Becket) Qin > >>>> > >>>> On Wed, Sep 9, 2020 at 12:42 AM Stephan Ewen <step...@ververica.com> > >>>> wrote: > >>>> > >>>>> Thanks for reviving this, Becket! > >>>>> > >>>>> I think Konstantin's comments are great. I'd add these points: > >>>>> > >>>>> *Num Bytes / Records Metrics* > >>>>> > >>>>> For "numBytesIn" and "numRecordsIn", we should reuse the > >>>>> OperatorIOMetric group, then it also gets reported to the overview > page in > >>>>> the Web UI. > >>>>> > >>>>> The "numBytesInPerSecond" and "numRecordsInPerSecond" are > >>>>> automatically derived metrics, no need to do anything once we > populate the > >>>>> above two metrics > >>>>> > >>>>> > >>>>> *Current (Fetch) Latency* > >>>>> > >>>>> I would really go for "eventTimeLag" rather than "fetchLatency". I > >>>>> think "eventTimeLag" is a term that has some adoption in the Flink > >>>>> community and beyond. > >>>>> > >>>>> I am not so sure that I see the benefit between "currentLatency" and > >>>>> "currentFetchLatency", (or event time lag before/after) as this only > is > >>>>> different by the time it takes to emit a batch. > >>>>> - In a non-backpressured case, these should be virtually > >>>>> identical (and both dominated by watermark lag, not the actual time > it > >>>>> takes the fetch to be emitted) > >>>>> - In a backpressured case, why do you care about when data was > >>>>> fetched, as opposed to emitted? Emitted time is relevant for > application > >>>>> semantics and checkpoints. Fetch time seems to be an implementation > detail > >>>>> (how much does the source buffer). > >>>>> > >>>>> The "currentLatency" (eventTimeLagAfter) can be computed > >>>>> out-of-the-box, independent of a source implementation, so that is > also a > >>>>> good argument to make it the main metric. > >>>>> We know timestamps and watermarks in the source. Except for cases > >>>>> where no watermarks have been defined at all (batch jobs or pure > processing > >>>>> time jobs), in which case this metric should probably be "Infinite". > >>>>> > >>>>> For "currentFetchLatency", we would need to understand timestamps > >>>>> before the records are decoded. That is only possible for some > sources, > >>>>> where the client gives us the records in a (partially) decoded from > already > >>>>> (like Kafka). Then, some work has been done between the fetch time > and the > >>>>> time we update the metric already, so it is already a bit closer to > the > >>>>> "currentFetchLatency". I think following this train of thought, > there is > >>>>> diminished benefit from that specific metric. > >>>>> > >>>>> > >>>>> *Idle Time* > >>>>> > >>>>> I agree, it would be great to rename this. Maybe to "sourceWaitTime" > >>>>> or "sourceIdleTime" so to make clear that this is not exactly the > time that > >>>>> Flink's processing pipeline is idle, but the time where the source > does not > >>>>> have new data. > >>>>> > >>>>> This is not an easy metric to collect, though (except maybe for the > >>>>> sources that are only idle while they have no split assigned, like > >>>>> continuous file source). > >>>>> > >>>>> *Source Specific Metrics* > >>>>> > >>>>> I believe source-specific would only be "sourceIdleTime", > >>>>> "numRecordsInErrors", "pendingBytes", and "pendingRecords". > >>>>> > >>>>> > >>>>> *Conclusion* > >>>>> > >>>>> We can probably add "numBytesIn" and "numRecordsIn" and > "eventTimeLag" > >>>>> right away, with little complexity. > >>>>> I'd suggest to start with these right away. > >>>>> > >>>>> Best, > >>>>> Stephan > >>>>> > >>>>> > >>>>> On Tue, Sep 8, 2020 at 3:25 PM Becket Qin <becket....@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Hey Konstantin, > >>>>>> > >>>>>> Thanks for the feedback and suggestions. Please see the reply below. > >>>>>> > >>>>>> * idleTime: In the meantime, a similar metric "idleTimeMsPerSecond" > >>>>>>> has > >>>>>>> been introduced in > https://issues.apache.org/jira/browse/FLINK-16864. > >>>>>>> They > >>>>>>> have a similar name, but different definitions of idleness, > >>>>>>> e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it > is > >>>>>>> backpressured. Can we make it clearer that these two metrics mean > >>>>>>> different > >>>>>>> things? > >>>>>> > >>>>>> > >>>>>> That is a good point. I did not notice this metric earlier. It seems > >>>>>> that both metrics are useful to the users. One tells them how busy > the > >>>>>> source is and how much more throughput the source can handle. The > other > >>>>>> tells the users how long since the source has seen the last record, > which > >>>>>> is useful for debugging. I'll update the FLIP to make it clear. > >>>>>> > >>>>>> * "current(Fetch)Latency" I am wondering if > >>>>>>> "eventTimeLag(Before|After)" > >>>>>>> is more descriptive/clear. What do others think? > >>>>>> > >>>>>> > >>>>>> I am quite open to the ideas on these names. In fact I also feel > >>>>>> "current(Fetch)Latency" are not super intuitive. So it would be > great if we > >>>>>> can have better names. > >>>>>> > >>>>>> * Current(Fetch)Latency implies that the timestamps are directly > >>>>>>> extracted in the source connector, right? Will this be the default > >>>>>>> for > >>>>>>> FLIP-27 sources anyway? > >>>>>> > >>>>>> > >>>>>> The "currentFetchLatency" will probably be reported by each source > >>>>>> implementation, because the data fetching is done by SplitReaders > and there > >>>>>> is no base implementation. The "currentLatency", on the other hand, > can be > >>>>>> reported by the SourceReader base implementation. However, since > developers > >>>>>> can actually implement their own source connector without using our > base > >>>>>> implementation, these metric guidance are still useful for the > connector > >>>>>> developers in that case. > >>>>>> > >>>>>> * Does FLIP-33 also include the implementation of these metrics (to > >>>>>>> the > >>>>>>> extent possible) for all connectors currently available in Apache > >>>>>>> Flink or > >>>>>>> is the "per-connector implementation" out-of-scope? > >>>>>> > >>>>>> > >>>>>> FLIP-33 itself does not specify any implementation of those metrics. > >>>>>> But the connectors we provide in Apache Flink will follow the > guidance of > >>>>>> FLIP-33 to emit those metrics when applicable. Maybe We can have > some > >>>>>> public static Strings defined for the metric names to help other > connector > >>>>>> developers follow the same guidance. I can also add that to the > public > >>>>>> interface section of the FLIP if we decide to do that. > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Jiangjie (Becket) Qin > >>>>>> > >>>>>> On Tue, Sep 8, 2020 at 9:02 PM Becket Qin <becket....@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Sep 8, 2020 at 6:55 PM Konstantin Knauf <kna...@apache.org > > > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Becket, > >>>>>>>> > >>>>>>>> Thank you for picking up this FLIP. I have a few questions: > >>>>>>>> > >>>>>>>> * two thoughts on naming: > >>>>>>>> * idleTime: In the meantime, a similar metric > >>>>>>>> "idleTimeMsPerSecond" has > >>>>>>>> been introduced in > >>>>>>>> https://issues.apache.org/jira/browse/FLINK-16864. They > >>>>>>>> have a similar name, but different definitions of idleness, > >>>>>>>> e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it > is > >>>>>>>> backpressured. Can we make it clearer that these two metrics mean > >>>>>>>> different > >>>>>>>> things? > >>>>>>>> > >>>>>>>> * "current(Fetch)Latency" I am wondering if > >>>>>>>> "eventTimeLag(Before|After)" > >>>>>>>> is more descriptive/clear. What do others think? > >>>>>>>> > >>>>>>>> * Current(Fetch)Latency implies that the timestamps are directly > >>>>>>>> extracted in the source connector, right? Will this be the default > >>>>>>>> for > >>>>>>>> FLIP-27 sources anyway? > >>>>>>>> > >>>>>>>> * Does FLIP-33 also include the implementation of these metrics > (to > >>>>>>>> the > >>>>>>>> extent possible) for all connectors currently available in Apache > >>>>>>>> Flink or > >>>>>>>> is the "per-connector implementation" out-of-scope? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Konstantin > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Fri, Sep 4, 2020 at 4:56 PM Becket Qin <becket....@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> > Hi all, > >>>>>>>> > > >>>>>>>> > To complete the Source refactoring work, I'd like to revive this > >>>>>>>> > discussion. Since the mail thread has been dormant for more than > >>>>>>>> a year, > >>>>>>>> > just to recap the motivation of the FLIP: > >>>>>>>> > > >>>>>>>> > 1. The FLIP proposes to standardize the connector metrics by > >>>>>>>> giving > >>>>>>>> > guidance on the metric specifications, including the name, type > >>>>>>>> and meaning > >>>>>>>> > of the metrics. > >>>>>>>> > 2. It is OK for a connector to not emit some of the metrics in > >>>>>>>> the metric > >>>>>>>> > guidance, but if a metric of the same semantic is emitted, the > >>>>>>>> > implementation should follow the guidance. > >>>>>>>> > 3. It is OK for a connector to emit more metrics than what are > >>>>>>>> listed in > >>>>>>>> > the FLIP. This includes having an alias for a metric specified > in > >>>>>>>> the FLIP. > >>>>>>>> > 4. We will implement some of the metrics out of the box in the > >>>>>>>> default > >>>>>>>> > implementation of FLIP-27, as long as it is applicable. > >>>>>>>> > > >>>>>>>> > The FLIP wiki is following: > >>>>>>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33 > >>>>>>>> > %3A+Standardize+Connector+Metrics > >>>>>>>> > > >>>>>>>> > Any thoughts? > >>>>>>>> > > >>>>>>>> > Thanks, > >>>>>>>> > > >>>>>>>> > Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > On Fri, Jun 14, 2019 at 2:29 PM Piotr Nowojski < > >>>>>>>> pi...@ververica.com> > >>>>>>>> > wrote: > >>>>>>>> > > >>>>>>>> > > > we will need to revisit the convention list and adjust them > >>>>>>>> accordingly > >>>>>>>> > > when FLIP-27 is ready > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > Yes, this sounds good :) > >>>>>>>> > > > >>>>>>>> > > Piotrek > >>>>>>>> > > > >>>>>>>> > > > On 13 Jun 2019, at 13:35, Becket Qin <becket....@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > > > > >>>>>>>> > > > Hi Piotr, > >>>>>>>> > > > > >>>>>>>> > > > That's great to know. Chances are that we will need to > >>>>>>>> revisit the > >>>>>>>> > > > convention list and adjust them accordingly when FLIP-27 is > >>>>>>>> ready, At > >>>>>>>> > > that > >>>>>>>> > > > point we can mark some of the metrics as available by > default > >>>>>>>> for > >>>>>>>> > > > connectors implementing the new interface. > >>>>>>>> > > > > >>>>>>>> > > > Thanks, > >>>>>>>> > > > > >>>>>>>> > > > Jiangjie (Becket) Qin > >>>>>>>> > > > > >>>>>>>> > > > On Thu, Jun 13, 2019 at 3:51 PM Piotr Nowojski < > >>>>>>>> pi...@ververica.com> > >>>>>>>> > > wrote: > >>>>>>>> > > > > >>>>>>>> > > >> Thanks for driving this. I’ve just noticed one small thing. > >>>>>>>> With new > >>>>>>>> > > >> SourceReader interface Flink will be able to provide > >>>>>>>> `idleTime` metric > >>>>>>>> > > >> automatically. > >>>>>>>> > > >> > >>>>>>>> > > >> Piotrek > >>>>>>>> > > >> > >>>>>>>> > > >>> On 13 Jun 2019, at 03:30, Becket Qin < > becket....@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > > >>> > >>>>>>>> > > >>> Thanks all for the feedback and discussion. > >>>>>>>> > > >>> > >>>>>>>> > > >>> Since there wasn't any concern raised, I've started the > >>>>>>>> voting thread > >>>>>>>> > > for > >>>>>>>> > > >>> this FLIP, but please feel free to continue the discussion > >>>>>>>> here if > >>>>>>>> > you > >>>>>>>> > > >>> think something still needs to be addressed. > >>>>>>>> > > >>> > >>>>>>>> > > >>> Thanks, > >>>>>>>> > > >>> > >>>>>>>> > > >>> Jiangjie (Becket) Qin > >>>>>>>> > > >>> > >>>>>>>> > > >>> > >>>>>>>> > > >>> > >>>>>>>> > > >>> On Mon, Jun 10, 2019 at 9:10 AM Becket Qin < > >>>>>>>> becket....@gmail.com> > >>>>>>>> > > wrote: > >>>>>>>> > > >>> > >>>>>>>> > > >>>> Hi Piotr, > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> Thanks for the comments. Yes, you are right. Users will > >>>>>>>> have to look > >>>>>>>> > > at > >>>>>>>> > > >>>> other metrics to decide whether the pipeline is healthy > or > >>>>>>>> not in > >>>>>>>> > the > >>>>>>>> > > >> first > >>>>>>>> > > >>>> place before they can use the time-based metric to fix > the > >>>>>>>> > bottleneck. > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> I agree that once we have FLIP-27 ready, some of the > >>>>>>>> metrics can > >>>>>>>> > just > >>>>>>>> > > be > >>>>>>>> > > >>>> reported by the abstract implementation. > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> I've updated FLIP-33 wiki page to add the pendingBytes > and > >>>>>>>> > > >> pendingRecords > >>>>>>>> > > >>>> metric. Please let me know if you have any concern over > >>>>>>>> the updated > >>>>>>>> > > >> metric > >>>>>>>> > > >>>> convention proposal. > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen > >>>>>>>> > > >>>> <step...@ververica.com> will you also have time to take > a > >>>>>>>> look at > >>>>>>>> > the > >>>>>>>> > > >>>> proposed metric convention? If there is no further > concern > >>>>>>>> I'll > >>>>>>>> > start > >>>>>>>> > > a > >>>>>>>> > > >>>> voting thread for this FLIP in two days. > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> Thanks, > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> > >>>>>>>> > > >>>> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski < > >>>>>>>> pi...@ververica.com> > >>>>>>>> > > >> wrote: > >>>>>>>> > > >>>> > >>>>>>>> > > >>>>> Hi Becket, > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> Thanks for the answer :) > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>>> By time-based metric, I meant the portion of time spent > >>>>>>>> on > >>>>>>>> > producing > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>> record to downstream. For example, a source connector > >>>>>>>> can report > >>>>>>>> > > that > >>>>>>>> > > >>>>> it's > >>>>>>>> > > >>>>>> spending 80% of time to emit record to downstream > >>>>>>>> processing > >>>>>>>> > > pipeline. > >>>>>>>> > > >>>>> In > >>>>>>>> > > >>>>>> another case, a sink connector may report that its > >>>>>>>> spending 30% of > >>>>>>>> > > >> time > >>>>>>>> > > >>>>>> producing the records to the external system. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> This is in some sense equivalent to the buffer usage > >>>>>>>> metric: > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>>> - 80% of time spent on emitting records to downstream > >>>>>>>> ---> > >>>>>>>> > > downstream > >>>>>>>> > > >>>>>> node is bottleneck ---> output buffer is probably full. > >>>>>>>> > > >>>>>> - 30% of time spent on emitting records to downstream > >>>>>>>> ---> > >>>>>>>> > > downstream > >>>>>>>> > > >>>>>> node is not bottleneck ---> output buffer is probably > >>>>>>>> not full. > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> If by “time spent on emitting records to downstream” you > >>>>>>>> understand > >>>>>>>> > > >>>>> “waiting on back pressure”, then I see your point. And I > >>>>>>>> agree that > >>>>>>>> > > >> some > >>>>>>>> > > >>>>> kind of ratio/time based metric gives you more > >>>>>>>> information. However > >>>>>>>> > > >> under > >>>>>>>> > > >>>>> “time spent on emitting records to downstream” might be > >>>>>>>> hidden the > >>>>>>>> > > >>>>> following (extreme) situation: > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> 1. Job is barely able to handle influx of records, there > >>>>>>>> is 99% > >>>>>>>> > > >>>>> CPU/resource usage in the cluster, but nobody is > >>>>>>>> > > >>>>> bottlenecked/backpressured, all output buffers are > empty, > >>>>>>>> everybody > >>>>>>>> > > is > >>>>>>>> > > >>>>> waiting in 1% of it’s time for more records to process. > >>>>>>>> > > >>>>> 2. 80% time can still be spent on "down stream > >>>>>>>> operators”, because > >>>>>>>> > > they > >>>>>>>> > > >>>>> are the CPU intensive operations, but this doesn’t mean > >>>>>>>> that > >>>>>>>> > > >> increasing the > >>>>>>>> > > >>>>> parallelism down the stream will help with anything > >>>>>>>> there. To the > >>>>>>>> > > >> contrary, > >>>>>>>> > > >>>>> increasing parallelism of the source operator might help > >>>>>>>> to > >>>>>>>> > increase > >>>>>>>> > > >>>>> resource utilisation up to 100%. > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> However, this “time based/ratio” approach can be > extended > >>>>>>>> to > >>>>>>>> > > in/output > >>>>>>>> > > >>>>> buffer usage. Besides collecting an information that > >>>>>>>> input/output > >>>>>>>> > > >> buffer is > >>>>>>>> > > >>>>> full/empty, we can probe profile how often are buffer > >>>>>>>> empty/full. > >>>>>>>> > If > >>>>>>>> > > >> output > >>>>>>>> > > >>>>> buffer is full 1% of times, there is almost no back > >>>>>>>> pressure. If > >>>>>>>> > it’s > >>>>>>>> > > >> full > >>>>>>>> > > >>>>> 80% of times, there is some back pressure, if it’s full > >>>>>>>> 99.9% of > >>>>>>>> > > times, > >>>>>>>> > > >>>>> there is huge back pressure. > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> Now for autoscaling you could compare the input & output > >>>>>>>> buffers > >>>>>>>> > fill > >>>>>>>> > > >>>>> ratio: > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> 1. Both are high, the source of bottleneck is down the > >>>>>>>> stream > >>>>>>>> > > >>>>> 2. Output is low, input is high, this is the bottleneck > >>>>>>>> and the > >>>>>>>> > > higher > >>>>>>>> > > >>>>> the difference, the bigger source of bottleneck is this > is > >>>>>>>> > > >> operator/task > >>>>>>>> > > >>>>> 3. Output is high, input is low - there was some load > >>>>>>>> spike that we > >>>>>>>> > > are > >>>>>>>> > > >>>>> currently finishing to process > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> But long story short, we are probably diverging from the > >>>>>>>> topic of > >>>>>>>> > > this > >>>>>>>> > > >>>>> discussion, and we can discuss this at some later point. > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> For now, for sources: > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> as I wrote before, +1 for: > >>>>>>>> > > >>>>> - pending.bytes, Gauge > >>>>>>>> > > >>>>> - pending.messages, Gauge > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> When we will be developing/discussing SourceReader from > >>>>>>>> FLIP-27 we > >>>>>>>> > > >> might > >>>>>>>> > > >>>>> then add: > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> - in-memory.buffer.usage (0 - 100%) > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> Which will be estimated automatically by Flink while > user > >>>>>>>> will be > >>>>>>>> > > able > >>>>>>>> > > >> to > >>>>>>>> > > >>>>> override/provide better estimation. > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> Piotrek > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>>> On 5 Jun 2019, at 05:42, Becket Qin < > >>>>>>>> becket....@gmail.com> wrote: > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> Hi Piotr, > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> Thanks for the explanation. Please see some > >>>>>>>> clarifications below. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> By time-based metric, I meant the portion of time spent > >>>>>>>> on > >>>>>>>> > producing > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>> record to downstream. For example, a source connector > >>>>>>>> can report > >>>>>>>> > > that > >>>>>>>> > > >>>>> it's > >>>>>>>> > > >>>>>> spending 80% of time to emit record to downstream > >>>>>>>> processing > >>>>>>>> > > pipeline. > >>>>>>>> > > >>>>> In > >>>>>>>> > > >>>>>> another case, a sink connector may report that its > >>>>>>>> spending 30% of > >>>>>>>> > > >> time > >>>>>>>> > > >>>>>> producing the records to the external system. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> This is in some sense equivalent to the buffer usage > >>>>>>>> metric: > >>>>>>>> > > >>>>>> - 80% of time spent on emitting records to downstream > >>>>>>>> ---> > >>>>>>>> > > downstream > >>>>>>>> > > >>>>>> node is bottleneck ---> output buffer is probably full. > >>>>>>>> > > >>>>>> - 30% of time spent on emitting records to downstream > >>>>>>>> ---> > >>>>>>>> > > downstream > >>>>>>>> > > >>>>>> node is not bottleneck ---> output buffer is probably > >>>>>>>> not full. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> However, the time-based metric has a few advantages > that > >>>>>>>> the > >>>>>>>> > buffer > >>>>>>>> > > >>>>> usage > >>>>>>>> > > >>>>>> metric may not have. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> 1. Buffer usage metric may not be applicable to all > the > >>>>>>>> connector > >>>>>>>> > > >>>>>> implementations, while reporting time-based metric are > >>>>>>>> always > >>>>>>>> > > doable. > >>>>>>>> > > >>>>>> Some source connectors may not have any input buffer, > or > >>>>>>>> they may > >>>>>>>> > > use > >>>>>>>> > > >>>>> some > >>>>>>>> > > >>>>>> third party library that does not expose the input > >>>>>>>> buffer at all. > >>>>>>>> > > >>>>>> Similarly, for sink connectors, the implementation may > >>>>>>>> not have > >>>>>>>> > any > >>>>>>>> > > >>>>> output > >>>>>>>> > > >>>>>> buffer, or the third party library does not expose such > >>>>>>>> buffer. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> 2. Although both type of metrics can detect bottleneck, > >>>>>>>> time-based > >>>>>>>> > > >>>>> metrics > >>>>>>>> > > >>>>>> can be used to generate a more informed action to > remove > >>>>>>>> the > >>>>>>>> > > >> bottleneck. > >>>>>>>> > > >>>>>> For example, when the downstream is bottleneck, the > >>>>>>>> output buffer > >>>>>>>> > > >> usage > >>>>>>>> > > >>>>>> metric is likely to be 100%, and the input buffer usage > >>>>>>>> might be > >>>>>>>> > 0%. > >>>>>>>> > > >>>>> That > >>>>>>>> > > >>>>>> means we don't know what is the suitable parallelism to > >>>>>>>> lift the > >>>>>>>> > > >>>>>> bottleneck. The time-based metric, on the other hand, > >>>>>>>> would give > >>>>>>>> > > >> useful > >>>>>>>> > > >>>>>> information, e.g. if 80% of time was spent on emitting > >>>>>>>> records, we > >>>>>>>> > > can > >>>>>>>> > > >>>>>> roughly increase the downstream node parallelism by 4 > >>>>>>>> times. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> Admittedly, the time-based metrics are more expensive > >>>>>>>> than buffer > >>>>>>>> > > >>>>> usage. So > >>>>>>>> > > >>>>>> we may have to do some sampling to reduce the cost. But > >>>>>>>> in > >>>>>>>> > general, > >>>>>>>> > > >>>>> using > >>>>>>>> > > >>>>>> time-based metrics seems worth adding. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> That being said, I don't think buffer usage metric and > >>>>>>>> time-based > >>>>>>>> > > >>>>> metrics > >>>>>>>> > > >>>>>> are mutually exclusive. We can probably have both. It > is > >>>>>>>> just that > >>>>>>>> > > in > >>>>>>>> > > >>>>>> practice, features like auto-scaling might prefer > >>>>>>>> time-based > >>>>>>>> > metrics > >>>>>>>> > > >> for > >>>>>>>> > > >>>>>> the reason stated above. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>>> 1. Define the metrics that would allow us to manually > >>>>>>>> detect > >>>>>>>> > > >>>>> bottlenecks. > >>>>>>>> > > >>>>>> As I wrote, we already have them in most of the places, > >>>>>>>> except of > >>>>>>>> > > >>>>>> sources/sinks. > >>>>>>>> > > >>>>>>> 2. Use those metrics, to automatically detect > >>>>>>>> bottlenecks. > >>>>>>>> > > Currently > >>>>>>>> > > >> we > >>>>>>>> > > >>>>>> are only automatically detecting back pressure and > >>>>>>>> reporting it to > >>>>>>>> > > the > >>>>>>>> > > >>>>> user > >>>>>>>> > > >>>>>> in web UI (is it exposed as a metric at all?). > Detecting > >>>>>>>> the root > >>>>>>>> > > >> cause > >>>>>>>> > > >>>>> of > >>>>>>>> > > >>>>>> the back pressure (bottleneck) is one step further. > >>>>>>>> > > >>>>>>> 3. Use the knowledge about where exactly the > bottleneck > >>>>>>>> is > >>>>>>>> > located, > >>>>>>>> > > >> to > >>>>>>>> > > >>>>>> try to do something with it. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> As explained above, I think time-based metric also > >>>>>>>> addresses item > >>>>>>>> > 1 > >>>>>>>> > > >> and > >>>>>>>> > > >>>>>> item 2. > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> Any thoughts? > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> Thanks, > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski < > >>>>>>>> > pi...@ververica.com> > >>>>>>>> > > >>>>> wrote: > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>>> Hi again :) > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>>> - pending.bytes, Gauge > >>>>>>>> > > >>>>>>>> - pending.messages, Gauge > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> +1 > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> And true, instead of overloading one of the metric it > >>>>>>>> is better > >>>>>>>> > > when > >>>>>>>> > > >>>>> user > >>>>>>>> > > >>>>>>> can choose to provide only one of them. > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> Re 2: > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>>> If I understand correctly, this metric along with the > >>>>>>>> pending > >>>>>>>> > > >> mesages > >>>>>>>> > > >>>>> / > >>>>>>>> > > >>>>>>>> bytes would answer the questions of: > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>>> - Does the connector consume fast enough? Lagging > >>>>>>>> behind + empty > >>>>>>>> > > >>>>> buffer > >>>>>>>> > > >>>>>>> = > >>>>>>>> > > >>>>>>>> cannot consume fast enough. > >>>>>>>> > > >>>>>>>> - Does the connector emit fast enough? Lagging behind > >>>>>>>> + full > >>>>>>>> > > buffer > >>>>>>>> > > >> = > >>>>>>>> > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is > >>>>>>>> slow. > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> Yes, exactly. This can also be used to support > >>>>>>>> decisions like > >>>>>>>> > > >> changing > >>>>>>>> > > >>>>> the > >>>>>>>> > > >>>>>>> parallelism of the sources and/or down stream > operators. > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> I’m not sure if I understand your proposal with time > >>>>>>>> based > >>>>>>>> > > >>>>> measurements. > >>>>>>>> > > >>>>>>> Maybe I’m missing something, but I do not see how > >>>>>>>> measuring time > >>>>>>>> > > >> alone > >>>>>>>> > > >>>>>>> could answer the problem: where is the bottleneck. > Time > >>>>>>>> spent on > >>>>>>>> > > the > >>>>>>>> > > >>>>>>> next/emit might be short or long (depending on how > >>>>>>>> heavy to > >>>>>>>> > process > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>>> record is) and the source can still be > bottlenecked/back > >>>>>>>> > pressured > >>>>>>>> > > or > >>>>>>>> > > >>>>> not. > >>>>>>>> > > >>>>>>> Usually the easiest and the most reliable way how to > >>>>>>>> detect > >>>>>>>> > > >>>>> bottlenecks is > >>>>>>>> > > >>>>>>> by checking usage of input & output buffers, since > when > >>>>>>>> input > >>>>>>>> > > buffer > >>>>>>>> > > >> is > >>>>>>>> > > >>>>>>> full while output buffer is empty, that’s the > >>>>>>>> definition of a > >>>>>>>> > > >>>>> bottleneck. > >>>>>>>> > > >>>>>>> Also this is usually very easy and cheap to measure > (it > >>>>>>>> works > >>>>>>>> > > >>>>> effectively > >>>>>>>> > > >>>>>>> the same way as current’s Flink back pressure > >>>>>>>> monitoring, but > >>>>>>>> > more > >>>>>>>> > > >>>>> cleanly, > >>>>>>>> > > >>>>>>> without probing thread’s stack traces). > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> Also keep in mind that we are already using the buffer > >>>>>>>> usage > >>>>>>>> > > metrics > >>>>>>>> > > >>>>> for > >>>>>>>> > > >>>>>>> detecting the bottlenecks in Flink’s internal network > >>>>>>>> exchanges > >>>>>>>> > > >> (manual > >>>>>>>> > > >>>>>>> work). That’s the reason why I wanted to extend this > to > >>>>>>>> > > >> sources/sinks, > >>>>>>>> > > >>>>>>> since they are currently our blind spot. > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>>> One feature we are currently working on to scale > Flink > >>>>>>>> > > automatically > >>>>>>>> > > >>>>>>> relies > >>>>>>>> > > >>>>>>>> on some metrics answering the same question > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> That would be very helpful feature. I think in order > to > >>>>>>>> achieve > >>>>>>>> > > that > >>>>>>>> > > >> we > >>>>>>>> > > >>>>>>> would need to: > >>>>>>>> > > >>>>>>> 1. Define the metrics that would allow us to manually > >>>>>>>> detect > >>>>>>>> > > >>>>> bottlenecks. > >>>>>>>> > > >>>>>>> As I wrote, we already have them in most of the > places, > >>>>>>>> except of > >>>>>>>> > > >>>>>>> sources/sinks. > >>>>>>>> > > >>>>>>> 2. Use those metrics, to automatically detect > >>>>>>>> bottlenecks. > >>>>>>>> > > Currently > >>>>>>>> > > >> we > >>>>>>>> > > >>>>>>> are only automatically detecting back pressure and > >>>>>>>> reporting it > >>>>>>>> > to > >>>>>>>> > > >> the > >>>>>>>> > > >>>>> user > >>>>>>>> > > >>>>>>> in web UI (is it exposed as a metric at all?). > >>>>>>>> Detecting the root > >>>>>>>> > > >>>>> cause of > >>>>>>>> > > >>>>>>> the back pressure (bottleneck) is one step further. > >>>>>>>> > > >>>>>>> 3. Use the knowledge about where exactly the > bottleneck > >>>>>>>> is > >>>>>>>> > located, > >>>>>>>> > > >> to > >>>>>>>> > > >>>>> try > >>>>>>>> > > >>>>>>> to do something with it. > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> I think you are aiming for point 3., but before we > >>>>>>>> reach it, we > >>>>>>>> > are > >>>>>>>> > > >>>>> still > >>>>>>>> > > >>>>>>> missing 1. & 2. Also even if we have 3., there is a > >>>>>>>> value in 1 & > >>>>>>>> > 2 > >>>>>>>> > > >> for > >>>>>>>> > > >>>>>>> manual analysis/dashboards. > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> However, having the knowledge of where the bottleneck > >>>>>>>> is, doesn’t > >>>>>>>> > > >>>>>>> necessarily mean that we know what we can do about it. > >>>>>>>> For > >>>>>>>> > example > >>>>>>>> > > >>>>>>> increasing parallelism might or might not help with > >>>>>>>> anything > >>>>>>>> > (data > >>>>>>>> > > >>>>> skew, > >>>>>>>> > > >>>>>>> bottleneck on some resource that does not scale), but > >>>>>>>> this remark > >>>>>>>> > > >>>>> applies > >>>>>>>> > > >>>>>>> always, regardless of the way how did we detect the > >>>>>>>> bottleneck. > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> Piotrek > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>>> On 3 Jun 2019, at 06:16, Becket Qin < > >>>>>>>> becket....@gmail.com> > >>>>>>>> > wrote: > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> Hi Piotr, > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> Thanks for the suggestion. Some thoughts below: > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> Re 1: The pending messages / bytes. > >>>>>>>> > > >>>>>>>> I completely agree these are very useful metrics and > >>>>>>>> we should > >>>>>>>> > > >> expect > >>>>>>>> > > >>>>> the > >>>>>>>> > > >>>>>>>> connector to report. WRT the way to expose them, it > >>>>>>>> seems more > >>>>>>>> > > >>>>> consistent > >>>>>>>> > > >>>>>>>> to add two metrics instead of adding a method (unless > >>>>>>>> there are > >>>>>>>> > > >> other > >>>>>>>> > > >>>>> use > >>>>>>>> > > >>>>>>>> cases other than metric reporting). So we can add the > >>>>>>>> following > >>>>>>>> > > two > >>>>>>>> > > >>>>>>> metrics. > >>>>>>>> > > >>>>>>>> - pending.bytes, Gauge > >>>>>>>> > > >>>>>>>> - pending.messages, Gauge > >>>>>>>> > > >>>>>>>> Applicable connectors can choose to report them. > These > >>>>>>>> two > >>>>>>>> > metrics > >>>>>>>> > > >>>>> along > >>>>>>>> > > >>>>>>>> with latency should be sufficient for users to > >>>>>>>> understand the > >>>>>>>> > > >> progress > >>>>>>>> > > >>>>>>> of a > >>>>>>>> > > >>>>>>>> connector. > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> Re 2: Number of buffered data in-memory of the > >>>>>>>> connector > >>>>>>>> > > >>>>>>>> If I understand correctly, this metric along with the > >>>>>>>> pending > >>>>>>>> > > >> mesages > >>>>>>>> > > >>>>> / > >>>>>>>> > > >>>>>>>> bytes would answer the questions of: > >>>>>>>> > > >>>>>>>> - Does the connector consume fast enough? Lagging > >>>>>>>> behind + empty > >>>>>>>> > > >>>>> buffer > >>>>>>>> > > >>>>>>> = > >>>>>>>> > > >>>>>>>> cannot consume fast enough. > >>>>>>>> > > >>>>>>>> - Does the connector emit fast enough? Lagging behind > >>>>>>>> + full > >>>>>>>> > > buffer > >>>>>>>> > > >> = > >>>>>>>> > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is > >>>>>>>> slow. > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> One feature we are currently working on to scale > Flink > >>>>>>>> > > automatically > >>>>>>>> > > >>>>>>> relies > >>>>>>>> > > >>>>>>>> on some metrics answering the same question, more > >>>>>>>> specifically, > >>>>>>>> > we > >>>>>>>> > > >> are > >>>>>>>> > > >>>>>>>> profiling the time spent on .next() method (time to > >>>>>>>> consume) and > >>>>>>>> > > the > >>>>>>>> > > >>>>> time > >>>>>>>> > > >>>>>>>> spent on .collect() method (time to emit / process). > >>>>>>>> One > >>>>>>>> > advantage > >>>>>>>> > > >> of > >>>>>>>> > > >>>>>>> such > >>>>>>>> > > >>>>>>>> method level time cost allows us to calculate the > >>>>>>>> parallelism > >>>>>>>> > > >>>>> required to > >>>>>>>> > > >>>>>>>> keep up in case their is a lag. > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> However, one concern I have regarding such metric is > >>>>>>>> that they > >>>>>>>> > are > >>>>>>>> > > >>>>>>>> implementation specific. Either profiling on the > >>>>>>>> method time, or > >>>>>>>> > > >>>>>>> reporting > >>>>>>>> > > >>>>>>>> buffer usage assumes the connector are implemented in > >>>>>>>> such a > >>>>>>>> > way. > >>>>>>>> > > A > >>>>>>>> > > >>>>>>>> slightly better solution might be have the following > >>>>>>>> metric: > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> - EmitTimeRatio (or FetchTimeRatio): The time spent > >>>>>>>> on emitting > >>>>>>>> > > >>>>>>>> records / Total time elapsed. > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> This assumes that the source connectors have to emit > >>>>>>>> the records > >>>>>>>> > > to > >>>>>>>> > > >>>>> the > >>>>>>>> > > >>>>>>>> downstream at some point. The emission may take some > >>>>>>>> time ( e.g. > >>>>>>>> > > go > >>>>>>>> > > >>>>>>> through > >>>>>>>> > > >>>>>>>> chained operators). And the rest of the time are > spent > >>>>>>>> to > >>>>>>>> > prepare > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>>>> record to emit, including time for consuming and > format > >>>>>>>> > > conversion, > >>>>>>>> > > >>>>> etc. > >>>>>>>> > > >>>>>>>> Ideally, we'd like to see the time spent on record > >>>>>>>> fetch and > >>>>>>>> > emit > >>>>>>>> > > to > >>>>>>>> > > >>>>> be > >>>>>>>> > > >>>>>>>> about the same, so no one is bottleneck for the > other. > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> The downside of these time based metrics is > additional > >>>>>>>> overhead > >>>>>>>> > to > >>>>>>>> > > >> get > >>>>>>>> > > >>>>>>> the > >>>>>>>> > > >>>>>>>> time, therefore sampling might be needed. But in > >>>>>>>> practice I feel > >>>>>>>> > > >> such > >>>>>>>> > > >>>>>>> time > >>>>>>>> > > >>>>>>>> based metric might be more useful if we want to take > >>>>>>>> action. > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> I think we should absolutely add metrics in (1) to > the > >>>>>>>> metric > >>>>>>>> > > >>>>> convention. > >>>>>>>> > > >>>>>>>> We could also add the metrics mentioned in (2) if we > >>>>>>>> reach > >>>>>>>> > > consensus > >>>>>>>> > > >>>>> on > >>>>>>>> > > >>>>>>>> that. What do you think? > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski < > >>>>>>>> > > pi...@ververica.com > >>>>>>>> > > >>> > >>>>>>>> > > >>>>>>> wrote: > >>>>>>>> > > >>>>>>>> > >>>>>>>> > > >>>>>>>>> Hey Becket, > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> Re 1a) and 1b) +1 from my side. > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> I’ve discussed this issue: > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow > us > >>>>>>>> to check > >>>>>>>> > > the > >>>>>>>> > > >>>>> cause > >>>>>>>> > > >>>>>>>>> of > >>>>>>>> > > >>>>>>>>>>>> back pressure: > >>>>>>>> > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes? > >>>>>>>> Or boolean > >>>>>>>> > > >>>>>>>>>>>> hasSomethingl/isEmpty) > >>>>>>>> > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? > Or > >>>>>>>> boolean > >>>>>>>> > > >>>>>>>>>>>> hasSomething/isEmpty) > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> With Nico at some lengths and he also saw the > >>>>>>>> benefits of them. > >>>>>>>> > > We > >>>>>>>> > > >>>>> also > >>>>>>>> > > >>>>>>>>> have more concrete proposal for that. > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> Actually there are two really useful metrics, that > we > >>>>>>>> are > >>>>>>>> > missing > >>>>>>>> > > >>>>>>>>> currently: > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> 1. Number of data/records/bytes in the backlog to > >>>>>>>> process. For > >>>>>>>> > > >>>>> example > >>>>>>>> > > >>>>>>>>> remaining number of bytes in unread files. Or > pending > >>>>>>>> data in > >>>>>>>> > > Kafka > >>>>>>>> > > >>>>>>> topics. > >>>>>>>> > > >>>>>>>>> 2. Number of buffered data in-memory of the > >>>>>>>> connector, that are > >>>>>>>> > > >>>>> waiting > >>>>>>>> > > >>>>>>> to > >>>>>>>> > > >>>>>>>>> be processed pushed to Flink pipeline. > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> Re 1: > >>>>>>>> > > >>>>>>>>> This would have to be a metric provided directly by > a > >>>>>>>> > connector. > >>>>>>>> > > It > >>>>>>>> > > >>>>>>> could > >>>>>>>> > > >>>>>>>>> be an undefined `int`: > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> `int backlog` - estimate of pending work. > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> “Undefined” meaning that it would be up to a > >>>>>>>> connector to > >>>>>>>> > decided > >>>>>>>> > > >>>>>>> whether > >>>>>>>> > > >>>>>>>>> it’s measured in bytes, records, pending files or > >>>>>>>> whatever it > >>>>>>>> > is > >>>>>>>> > > >>>>>>> possible > >>>>>>>> > > >>>>>>>>> to provide by the connector. This is because I > assume > >>>>>>>> not every > >>>>>>>> > > >>>>>>> connector > >>>>>>>> > > >>>>>>>>> can provide exact number and for some of them it > >>>>>>>> might be > >>>>>>>> > > >> impossible > >>>>>>>> > > >>>>> to > >>>>>>>> > > >>>>>>>>> provide records number of bytes count. > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> Re 2: > >>>>>>>> > > >>>>>>>>> This metric could be either provided by a connector, > >>>>>>>> or > >>>>>>>> > > calculated > >>>>>>>> > > >>>>>>> crudely > >>>>>>>> > > >>>>>>>>> by Flink: > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> `float bufferUsage` - value from [0.0, 1.0] range > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> Percentage of used in memory buffers, like in > Kafka’s > >>>>>>>> handover. > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> It could be crudely implemented by Flink with > FLIP-27 > >>>>>>>> > > >>>>>>>>> SourceReader#isAvailable. If SourceReader is not > >>>>>>>> available > >>>>>>>> > > reported > >>>>>>>> > > >>>>>>>>> `bufferUsage` could be 0.0. If it is available, it > >>>>>>>> could be > >>>>>>>> > 1.0. > >>>>>>>> > > I > >>>>>>>> > > >>>>> think > >>>>>>>> > > >>>>>>>>> this would be a good enough estimation for most of > >>>>>>>> the use > >>>>>>>> > cases > >>>>>>>> > > >>>>> (that > >>>>>>>> > > >>>>>>>>> could be overloaded and implemented better if > >>>>>>>> desired). > >>>>>>>> > > Especially > >>>>>>>> > > >>>>>>> since we > >>>>>>>> > > >>>>>>>>> are reporting only probed values: if probed values > >>>>>>>> are almost > >>>>>>>> > > >> always > >>>>>>>> > > >>>>>>> “1.0”, > >>>>>>>> > > >>>>>>>>> it would mean that we have a back pressure. If they > >>>>>>>> are almost > >>>>>>>> > > >> always > >>>>>>>> > > >>>>>>>>> “0.0”, there is probably no back pressure at the > >>>>>>>> sources. > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> What do you think about this? > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> Piotrek > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>>> On 30 May 2019, at 11:41, Becket Qin < > >>>>>>>> becket....@gmail.com> > >>>>>>>> > > >> wrote: > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> Hi all, > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> Thanks a lot for all the feedback and comments. I'd > >>>>>>>> like to > >>>>>>>> > > >> continue > >>>>>>>> > > >>>>>>> the > >>>>>>>> > > >>>>>>>>>> discussion on this FLIP. > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> I updated the FLIP-33 wiki to remove all the > >>>>>>>> histogram metrics > >>>>>>>> > > >> from > >>>>>>>> > > >>>>> the > >>>>>>>> > > >>>>>>>>>> first version of metric convention due to the > >>>>>>>> performance > >>>>>>>> > > concern. > >>>>>>>> > > >>>>> The > >>>>>>>> > > >>>>>>>>> plan > >>>>>>>> > > >>>>>>>>>> is to introduce them later when we have a mechanism > >>>>>>>> to opt > >>>>>>>> > > in/out > >>>>>>>> > > >>>>>>>>> metrics. > >>>>>>>> > > >>>>>>>>>> At that point, users can decide whether they want > to > >>>>>>>> pay the > >>>>>>>> > > cost > >>>>>>>> > > >> to > >>>>>>>> > > >>>>>>> get > >>>>>>>> > > >>>>>>>>>> the metric or not. > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> As Stephan suggested, for this FLIP, let's first > try > >>>>>>>> to agree > >>>>>>>> > on > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>>>>> small > >>>>>>>> > > >>>>>>>>>> list of conventional metrics that connectors should > >>>>>>>> follow. > >>>>>>>> > > >>>>>>>>>> Just to be clear, the purpose of the convention is > >>>>>>>> not to > >>>>>>>> > > enforce > >>>>>>>> > > >>>>> every > >>>>>>>> > > >>>>>>>>>> connector to report all these metrics, but to > >>>>>>>> provide a > >>>>>>>> > guidance > >>>>>>>> > > >> in > >>>>>>>> > > >>>>>>> case > >>>>>>>> > > >>>>>>>>>> these metrics are reported by some connectors. > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> @ Stephan & Chesnay, > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> Regarding the duplication of `RecordsIn` metric in > >>>>>>>> operator / > >>>>>>>> > > task > >>>>>>>> > > >>>>>>>>>> IOMetricGroups, from what I understand, for source > >>>>>>>> operator, > >>>>>>>> > it > >>>>>>>> > > is > >>>>>>>> > > >>>>>>>>> actually > >>>>>>>> > > >>>>>>>>>> the SourceFunction that reports the operator level > >>>>>>>> > > >>>>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are > >>>>>>>> essentially > >>>>>>>> > the > >>>>>>>> > > >>>>> same > >>>>>>>> > > >>>>>>>>>> metric in the operator level IOMetricGroup. > >>>>>>>> Similarly for the > >>>>>>>> > > Sink > >>>>>>>> > > >>>>>>>>>> operator, the operator level > >>>>>>>> RecordsOut/RecordsOutPerSecond > >>>>>>>> > > >> metrics > >>>>>>>> > > >>>>> are > >>>>>>>> > > >>>>>>>>>> also reported by the Sink function. I marked them > as > >>>>>>>> existing > >>>>>>>> > in > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>>>>>> FLIP-33 wiki page. Please let me know if I > >>>>>>>> misunderstood. > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin < > >>>>>>>> > > becket....@gmail.com> > >>>>>>>> > > >>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>> Hi Piotr, > >>>>>>>> > > >>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>> Thanks a lot for the feedback. > >>>>>>>> > > >>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>> 1a) I guess you are referring to the part that > >>>>>>>> "original > >>>>>>>> > system > >>>>>>>> > > >>>>>>> specific > >>>>>>>> > > >>>>>>>>>>> metrics should also be reported". The performance > >>>>>>>> impact > >>>>>>>> > > depends > >>>>>>>> > > >> on > >>>>>>>> > > >>>>>>> the > >>>>>>>> > > >>>>>>>>>>> implementation. An efficient implementation would > >>>>>>>> only record > >>>>>>>> > > the > >>>>>>>> > > >>>>>>> metric > >>>>>>>> > > >>>>>>>>>>> once, but report them with two different metric > >>>>>>>> names. This > >>>>>>>> > is > >>>>>>>> > > >>>>>>> unlikely > >>>>>>>> > > >>>>>>>>> to > >>>>>>>> > > >>>>>>>>>>> hurt performance. > >>>>>>>> > > >>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>> 1b) Yes, I agree that we should avoid adding > >>>>>>>> overhead to the > >>>>>>>> > > >>>>> critical > >>>>>>>> > > >>>>>>>>> path > >>>>>>>> > > >>>>>>>>>>> by all means. This is sometimes a tradeoff, > running > >>>>>>>> blindly > >>>>>>>> > > >> without > >>>>>>>> > > >>>>>>> any > >>>>>>>> > > >>>>>>>>>>> metric gives best performance, but sometimes might > >>>>>>>> be > >>>>>>>> > > frustrating > >>>>>>>> > > >>>>> when > >>>>>>>> > > >>>>>>>>> we > >>>>>>>> > > >>>>>>>>>>> debug some issues. > >>>>>>>> > > >>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>> 2. The metrics are indeed very useful. Are they > >>>>>>>> supposed to > >>>>>>>> > be > >>>>>>>> > > >>>>>>> reported > >>>>>>>> > > >>>>>>>>> by > >>>>>>>> > > >>>>>>>>>>> the connectors or Flink itself? At this point > >>>>>>>> FLIP-33 is more > >>>>>>>> > > >>>>> focused > >>>>>>>> > > >>>>>>> on > >>>>>>>> > > >>>>>>>>>>> provide a guidance to the connector authors on the > >>>>>>>> metrics > >>>>>>>> > > >>>>> reporting. > >>>>>>>> > > >>>>>>>>> That > >>>>>>>> > > >>>>>>>>>>> said, after FLIP-27, I think we should absolutely > >>>>>>>> report > >>>>>>>> > these > >>>>>>>> > > >>>>> metrics > >>>>>>>> > > >>>>>>>>> in > >>>>>>>> > > >>>>>>>>>>> the abstract implementation. In any case, the > metric > >>>>>>>> > convention > >>>>>>>> > > >> in > >>>>>>>> > > >>>>>>> this > >>>>>>>> > > >>>>>>>>>>> list are expected to evolve over time. > >>>>>>>> > > >>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski < > >>>>>>>> > > >>>>> pi...@ververica.com> > >>>>>>>> > > >>>>>>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> Hi, > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> Thanks for the proposal and driving the effort > >>>>>>>> here Becket > >>>>>>>> > :) > >>>>>>>> > > >> I’ve > >>>>>>>> > > >>>>>>> read > >>>>>>>> > > >>>>>>>>>>>> through the FLIP-33 [1], and here are couple of > my > >>>>>>>> thoughts. > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> Big +1 for standardising the metric names between > >>>>>>>> > connectors, > >>>>>>>> > > it > >>>>>>>> > > >>>>> will > >>>>>>>> > > >>>>>>>>>>>> definitely help us and users a lot. > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> Issues/questions/things to discuss that I’ve > >>>>>>>> thought of: > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> 1a. If we are about to duplicate some metrics, > can > >>>>>>>> this > >>>>>>>> > > become a > >>>>>>>> > > >>>>>>>>>>>> performance issue? > >>>>>>>> > > >>>>>>>>>>>> 1b. Generally speaking, we should make sure that > >>>>>>>> collecting > >>>>>>>> > > >> those > >>>>>>>> > > >>>>>>>>> metrics > >>>>>>>> > > >>>>>>>>>>>> is as non intrusive as possible, especially that > >>>>>>>> they will > >>>>>>>> > > need > >>>>>>>> > > >>>>> to be > >>>>>>>> > > >>>>>>>>>>>> updated once per record. (They might be collected > >>>>>>>> more > >>>>>>>> > rarely > >>>>>>>> > > >> with > >>>>>>>> > > >>>>>>> some > >>>>>>>> > > >>>>>>>>>>>> overhead, but the hot path of updating it per > >>>>>>>> record will > >>>>>>>> > need > >>>>>>>> > > >> to > >>>>>>>> > > >>>>> be > >>>>>>>> > > >>>>>>> as > >>>>>>>> > > >>>>>>>>>>>> quick as possible). That includes both avoiding > >>>>>>>> heavy > >>>>>>>> > > >> computation > >>>>>>>> > > >>>>> on > >>>>>>>> > > >>>>>>>>> per > >>>>>>>> > > >>>>>>>>>>>> record path: histograms?, measuring time for time > >>>>>>>> based > >>>>>>>> > > metrics > >>>>>>>> > > >>>>> (per > >>>>>>>> > > >>>>>>>>>>>> second) (System.currentTimeMillis() depending on > >>>>>>>> the > >>>>>>>> > > >>>>> implementation > >>>>>>>> > > >>>>>>> can > >>>>>>>> > > >>>>>>>>>>>> invoke a system call) > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow > us > >>>>>>>> to check > >>>>>>>> > > the > >>>>>>>> > > >>>>> cause > >>>>>>>> > > >>>>>>>>> of > >>>>>>>> > > >>>>>>>>>>>> back pressure: > >>>>>>>> > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes? > >>>>>>>> Or boolean > >>>>>>>> > > >>>>>>>>>>>> hasSomethingl/isEmpty) > >>>>>>>> > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? > Or > >>>>>>>> boolean > >>>>>>>> > > >>>>>>>>>>>> hasSomething/isEmpty) > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> a) is useful in a scenario when we are processing > >>>>>>>> backlog of > >>>>>>>> > > >>>>> records, > >>>>>>>> > > >>>>>>>>> all > >>>>>>>> > > >>>>>>>>>>>> of the internal Flink’s input/output network > >>>>>>>> buffers are > >>>>>>>> > > empty, > >>>>>>>> > > >>>>> and > >>>>>>>> > > >>>>>>> we > >>>>>>>> > > >>>>>>>>> want > >>>>>>>> > > >>>>>>>>>>>> to check whether the external source system is > the > >>>>>>>> > bottleneck > >>>>>>>> > > >>>>>>> (source’s > >>>>>>>> > > >>>>>>>>>>>> input queue will be empty), or if the Flink’s > >>>>>>>> connector is > >>>>>>>> > the > >>>>>>>> > > >>>>>>>>> bottleneck > >>>>>>>> > > >>>>>>>>>>>> (source’s input queues will be full). > >>>>>>>> > > >>>>>>>>>>>> b) similar story. Backlog of records, but this > >>>>>>>> time all of > >>>>>>>> > the > >>>>>>>> > > >>>>>>> internal > >>>>>>>> > > >>>>>>>>>>>> Flink’s input/ouput network buffers are full, and > >>>>>>>> we want o > >>>>>>>> > > >> check > >>>>>>>> > > >>>>>>>>> whether > >>>>>>>> > > >>>>>>>>>>>> the external sink system is the bottleneck (sink > >>>>>>>> output > >>>>>>>> > queues > >>>>>>>> > > >> are > >>>>>>>> > > >>>>>>>>> full), > >>>>>>>> > > >>>>>>>>>>>> or if the Flink’s connector is the bottleneck > >>>>>>>> (sink’s output > >>>>>>>> > > >>>>> queues > >>>>>>>> > > >>>>>>>>> will be > >>>>>>>> > > >>>>>>>>>>>> empty) > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> It might be sometimes difficult to provide those > >>>>>>>> metrics, so > >>>>>>>> > > >> they > >>>>>>>> > > >>>>>>> could > >>>>>>>> > > >>>>>>>>>>>> be optional, but if we could provide them, it > >>>>>>>> would be > >>>>>>>> > really > >>>>>>>> > > >>>>>>> helpful. > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> Piotrek > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> [1] > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>> > >>>>>>>> > > >> > >>>>>>>> > > > >>>>>>>> > > >>>>>>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics > >>>>>>>> > > >>>>>>>>>>>> < > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>> > >>>>>>>> > > >> > >>>>>>>> > > > >>>>>>>> > > >>>>>>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen < > >>>>>>>> se...@apache.org> > >>>>>>>> > > >> wrote: > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>> I think this sounds reasonable. > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>> Let's keep the "reconfiguration without stopping > >>>>>>>> the job" > >>>>>>>> > out > >>>>>>>> > > >> of > >>>>>>>> > > >>>>>>> this, > >>>>>>>> > > >>>>>>>>>>>>> because that would be a super big effort and if > >>>>>>>> we approach > >>>>>>>> > > >> that, > >>>>>>>> > > >>>>>>> then > >>>>>>>> > > >>>>>>>>>>>> in > >>>>>>>> > > >>>>>>>>>>>>> more generic way rather than specific to > >>>>>>>> connector metrics. > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>> I would suggest to look at the following things > >>>>>>>> before > >>>>>>>> > > starting > >>>>>>>> > > >>>>> with > >>>>>>>> > > >>>>>>>>> any > >>>>>>>> > > >>>>>>>>>>>>> implementation work: > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>> - Try and find a committer to support this, > >>>>>>>> otherwise it > >>>>>>>> > will > >>>>>>>> > > >> be > >>>>>>>> > > >>>>>>> hard > >>>>>>>> > > >>>>>>>>>>>> to > >>>>>>>> > > >>>>>>>>>>>>> make progress > >>>>>>>> > > >>>>>>>>>>>>> - Start with defining a smaller set of "core > >>>>>>>> metrics" and > >>>>>>>> > > >> extend > >>>>>>>> > > >>>>> the > >>>>>>>> > > >>>>>>>>>>>> set > >>>>>>>> > > >>>>>>>>>>>>> later. I think that is easier than now blocking > >>>>>>>> on reaching > >>>>>>>> > > >>>>>>> consensus > >>>>>>>> > > >>>>>>>>>>>> on a > >>>>>>>> > > >>>>>>>>>>>>> large group of metrics. > >>>>>>>> > > >>>>>>>>>>>>> - Find a solution to the problem Chesnay > >>>>>>>> mentioned, that > >>>>>>>> > the > >>>>>>>> > > >>>>>>> "records > >>>>>>>> > > >>>>>>>>>>>> in" > >>>>>>>> > > >>>>>>>>>>>>> metric is somehow overloaded and exists already > >>>>>>>> in the IO > >>>>>>>> > > >> Metric > >>>>>>>> > > >>>>>>>>> group. > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin < > >>>>>>>> > > >> becket....@gmail.com > >>>>>>>> > > >>>>>> > >>>>>>>> > > >>>>>>>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> Hi Stephan, > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> Thanks a lot for the feedback. All makes sense. > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> It is a good suggestion to simply have an > >>>>>>>> > onRecord(numBytes, > >>>>>>>> > > >>>>>>>>> eventTime) > >>>>>>>> > > >>>>>>>>>>>>>> method for connector writers. It should meet > >>>>>>>> most of the > >>>>>>>> > > >>>>>>>>> requirements, > >>>>>>>> > > >>>>>>>>>>>>>> individual > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> The configurable metrics feature is something > >>>>>>>> really > >>>>>>>> > useful, > >>>>>>>> > > >>>>>>>>>>>> especially if > >>>>>>>> > > >>>>>>>>>>>>>> we can somehow make it dynamically configurable > >>>>>>>> without > >>>>>>>> > > >> stopping > >>>>>>>> > > >>>>>>> the > >>>>>>>> > > >>>>>>>>>>>> jobs. > >>>>>>>> > > >>>>>>>>>>>>>> It might be better to make it a separate > >>>>>>>> discussion > >>>>>>>> > because > >>>>>>>> > > it > >>>>>>>> > > >>>>> is a > >>>>>>>> > > >>>>>>>>>>>> more > >>>>>>>> > > >>>>>>>>>>>>>> generic feature instead of only for connectors. > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> So in order to make some progress, in this FLIP > >>>>>>>> we can > >>>>>>>> > limit > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>>>>>>>> discussion > >>>>>>>> > > >>>>>>>>>>>>>> scope to the connector related items: > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> - the standard connector metric names and > types. > >>>>>>>> > > >>>>>>>>>>>>>> - the abstract ConnectorMetricHandler interface > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> I'll start a separate thread to discuss other > >>>>>>>> general > >>>>>>>> > metric > >>>>>>>> > > >>>>>>> related > >>>>>>>> > > >>>>>>>>>>>>>> enhancement items including: > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> - optional metrics > >>>>>>>> > > >>>>>>>>>>>>>> - dynamic metric configuration > >>>>>>>> > > >>>>>>>>>>>>>> - potential combination with rate limiter > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> Does this plan sound reasonable? > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen < > >>>>>>>> > > >> se...@apache.org> > >>>>>>>> > > >>>>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> Ignoring for a moment implementation details, > >>>>>>>> this > >>>>>>>> > > connector > >>>>>>>> > > >>>>>>> metrics > >>>>>>>> > > >>>>>>>>>>>> work > >>>>>>>> > > >>>>>>>>>>>>>>> is a really good thing to do, in my opinion > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> The questions "oh, my job seems to be doing > >>>>>>>> nothing, I am > >>>>>>>> > > >>>>> looking > >>>>>>>> > > >>>>>>> at > >>>>>>>> > > >>>>>>>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>> UI > >>>>>>>> > > >>>>>>>>>>>>>>> and the 'records in' value is still zero" is > in > >>>>>>>> the top > >>>>>>>> > > three > >>>>>>>> > > >>>>>>>>> support > >>>>>>>> > > >>>>>>>>>>>>>>> questions I have been asked personally. > >>>>>>>> > > >>>>>>>>>>>>>>> Introspection into "how far is the consumer > >>>>>>>> lagging > >>>>>>>> > behind" > >>>>>>>> > > >>>>> (event > >>>>>>>> > > >>>>>>>>>>>> time > >>>>>>>> > > >>>>>>>>>>>>>>> fetch latency) came up many times as well. > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> So big +1 to solving this problem. > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> About the exact design - I would try to go for > >>>>>>>> the > >>>>>>>> > > following > >>>>>>>> > > >>>>>>>>>>>> properties: > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> - keep complexity of of connectors. Ideally > the > >>>>>>>> metrics > >>>>>>>> > > >> handler > >>>>>>>> > > >>>>>>> has > >>>>>>>> > > >>>>>>>>> a > >>>>>>>> > > >>>>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or > >>>>>>>> so, and > >>>>>>>> > > >>>>> everything > >>>>>>>> > > >>>>>>>>>>>> else is > >>>>>>>> > > >>>>>>>>>>>>>>> internal to the handler. That makes it dead > >>>>>>>> simple for > >>>>>>>> > the > >>>>>>>> > > >>>>>>>>> connector. > >>>>>>>> > > >>>>>>>>>>>> We > >>>>>>>> > > >>>>>>>>>>>>>>> can also think of an extensive scheme for > >>>>>>>> connector > >>>>>>>> > > specific > >>>>>>>> > > >>>>>>>>> metrics. > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> - make it configurable on the job it cluster > >>>>>>>> level which > >>>>>>>> > > >>>>> metrics > >>>>>>>> > > >>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>>> handler internally creates when that method is > >>>>>>>> invoked. > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> What do you think? > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> Best, > >>>>>>>> > > >>>>>>>>>>>>>>> Stephan > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay > >>>>>>>> Schepler < > >>>>>>>> > > >>>>>>>>> ches...@apache.org > >>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>> As I said before, I believe this to be > >>>>>>>> over-engineered > >>>>>>>> > and > >>>>>>>> > > >>>>> have > >>>>>>>> > > >>>>>>> no > >>>>>>>> > > >>>>>>>>>>>>>>>> interest in this implementation. > >>>>>>>> > > >>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>> There are conceptual issues like defining a > >>>>>>>> duplicate > >>>>>>>> > > >>>>>>>>>>>>>> numBytesIn(PerSec) > >>>>>>>> > > >>>>>>>>>>>>>>>> metric that already exists for each operator. > >>>>>>>> > > >>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>> A few updates to the thread. I uploaded a > >>>>>>>> patch[1] as a > >>>>>>>> > > >>>>> complete > >>>>>>>> > > >>>>>>>>>>>>>>>>> example of how users can use the metrics in > >>>>>>>> the future. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Some thoughts below after taking a look at > the > >>>>>>>> > > >>>>>>> AbstractMetricGroup > >>>>>>>> > > >>>>>>>>>>>>>> and > >>>>>>>> > > >>>>>>>>>>>>>>>>> its subclasses. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> This patch intends to provide convenience > for > >>>>>>>> Flink > >>>>>>>> > > >> connector > >>>>>>>> > > >>>>>>>>>>>>>>>>> implementations to follow metrics standards > >>>>>>>> proposed in > >>>>>>>> > > >>>>> FLIP-33. > >>>>>>>> > > >>>>>>>>> It > >>>>>>>> > > >>>>>>>>>>>>>>>>> also try to enhance the metric management in > >>>>>>>> general > >>>>>>>> > way > >>>>>>>> > > to > >>>>>>>> > > >>>>> help > >>>>>>>> > > >>>>>>>>>>>>>> users > >>>>>>>> > > >>>>>>>>>>>>>>>>> with: > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> 1. metric definition > >>>>>>>> > > >>>>>>>>>>>>>>>>> 2. metric dependencies check > >>>>>>>> > > >>>>>>>>>>>>>>>>> 3. metric validation > >>>>>>>> > > >>>>>>>>>>>>>>>>> 4. metric control (turn on / off particular > >>>>>>>> metrics) > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the > >>>>>>>> > > functionality > >>>>>>>> > > >> of > >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses. > The > >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the > >>>>>>>> metric group > >>>>>>>> > > >>>>>>> hierarchy, > >>>>>>>> > > >>>>>>>>>>>> but > >>>>>>>> > > >>>>>>>>>>>>>>>>> does not really manage the metrics other > than > >>>>>>>> keeping > >>>>>>>> > > them > >>>>>>>> > > >>>>> in a > >>>>>>>> > > >>>>>>>>> Map. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Ideally we should only have one entry point > >>>>>>>> for the > >>>>>>>> > > >> metrics. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Right now the entry point is > >>>>>>>> |AbstractMetricGroup|. > >>>>>>>> > > >> However, > >>>>>>>> > > >>>>>>>>> besides > >>>>>>>> > > >>>>>>>>>>>>>>>>> the missing functionality mentioned above, > >>>>>>>> > > >>>>> |AbstractMetricGroup| > >>>>>>>> > > >>>>>>>>>>>>>> seems > >>>>>>>> > > >>>>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could > >>>>>>>> extract it out > >>>>>>>> > > to > >>>>>>>> > > >>>>>>>>>>>>>>>>> flink-metrics in order to use it for generic > >>>>>>>> purpose. > >>>>>>>> > > There > >>>>>>>> > > >>>>> will > >>>>>>>> > > >>>>>>>>> be > >>>>>>>> > > >>>>>>>>>>>>>>>>> some work, though. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Another approach is to make > |AbstractMetrics| > >>>>>>>> in this > >>>>>>>> > > patch > >>>>>>>> > > >>>>> as > >>>>>>>> > > >>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>>>>> metric entry point. It wraps metric group > and > >>>>>>>> provides > >>>>>>>> > > the > >>>>>>>> > > >>>>>>> missing > >>>>>>>> > > >>>>>>>>>>>>>>>>> functionalities. Then we can roll out this > >>>>>>>> pattern to > >>>>>>>> > > >> runtime > >>>>>>>> > > >>>>>>>>>>>>>>>>> components gradually as well. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> My first thought is that the latter approach > >>>>>>>> gives a > >>>>>>>> > more > >>>>>>>> > > >>>>> smooth > >>>>>>>> > > >>>>>>>>>>>>>>>>> migration. But I am also OK with doing a > >>>>>>>> refactoring on > >>>>>>>> > > the > >>>>>>>> > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| family. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> [1] > https://github.com/becketqin/flink/pull/1 > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin < > >>>>>>>> > > >>>>>>> becket....@gmail.com > >>>>>>>> > > >>>>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Hi Chesnay, > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> It might be easier to discuss some > >>>>>>>> implementation > >>>>>>>> > details > >>>>>>>> > > >> in > >>>>>>>> > > >>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>>>>> PR review instead of in the FLIP discussion > >>>>>>>> thread. I > >>>>>>>> > > have > >>>>>>>> > > >> a > >>>>>>>> > > >>>>>>>>>>>>>> patch > >>>>>>>> > > >>>>>>>>>>>>>>>>> for Kafka connectors ready but haven't > >>>>>>>> submitted the PR > >>>>>>>> > > >> yet. > >>>>>>>> > > >>>>>>>>>>>>>>>>> Hopefully that will help explain a bit more. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> ** Re: metric type binding > >>>>>>>> > > >>>>>>>>>>>>>>>>> This is a valid point that worths > discussing. > >>>>>>>> If I > >>>>>>>> > > >> understand > >>>>>>>> > > >>>>>>>>>>>>>>>>> correctly, there are two points: > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> 1. Metric type / interface does not matter > as > >>>>>>>> long as > >>>>>>>> > the > >>>>>>>> > > >>>>>>> metric > >>>>>>>> > > >>>>>>>>>>>>>>>>> semantic is clearly defined. > >>>>>>>> > > >>>>>>>>>>>>>>>>> Conceptually speaking, I agree that as long > >>>>>>>> as the > >>>>>>>> > metric > >>>>>>>> > > >>>>>>>>>>>>>> semantic > >>>>>>>> > > >>>>>>>>>>>>>>>>> is defined, metric type does not matter. To > >>>>>>>> some > >>>>>>>> > extent, > >>>>>>>> > > >>>>> Gauge > >>>>>>>> > > >>>>>>> / > >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can > be > >>>>>>>> think of > >>>>>>>> > as > >>>>>>>> > > >>>>> some > >>>>>>>> > > >>>>>>>>>>>>>>>>> well-recognized semantics, if you wish. In > >>>>>>>> Flink, these > >>>>>>>> > > >>>>> metric > >>>>>>>> > > >>>>>>>>>>>>>>>>> semantics have their associated interface > >>>>>>>> classes. In > >>>>>>>> > > >>>>> practice, > >>>>>>>> > > >>>>>>>>>>>>>>>>> such semantic to interface binding seems > >>>>>>>> necessary for > >>>>>>>> > > >>>>>>> different > >>>>>>>> > > >>>>>>>>>>>>>>>>> components to communicate. Simply > >>>>>>>> standardize the > >>>>>>>> > > semantic > >>>>>>>> > > >>>>> of > >>>>>>>> > > >>>>>>>>>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>>>>> connector metrics seems not sufficient for > >>>>>>>> people to > >>>>>>>> > > build > >>>>>>>> > > >>>>>>>>>>>>>>>>> ecosystem on top of. At the end of the day, > >>>>>>>> we still > >>>>>>>> > need > >>>>>>>> > > >> to > >>>>>>>> > > >>>>>>>>> have > >>>>>>>> > > >>>>>>>>>>>>>>>>> some embodiment of the metric semantics that > >>>>>>>> people can > >>>>>>>> > > >>>>> program > >>>>>>>> > > >>>>>>>>>>>>>>>>> against. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be > >>>>>>>> exposed > >>>>>>>> > > using > >>>>>>>> > > >>>>>>>>>>>>>>>>> different metric types / interfaces. > >>>>>>>> > > >>>>>>>>>>>>>>>>> This is a good point. Counter and > >>>>>>>> Gauge-as-a-Counter > >>>>>>>> > are > >>>>>>>> > > >>>>> pretty > >>>>>>>> > > >>>>>>>>>>>>>>>>> much interchangeable. This is more of a > >>>>>>>> trade-off > >>>>>>>> > between > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>>>>>>>>>> user > >>>>>>>> > > >>>>>>>>>>>>>>>>> experience of metric producers and > consumers. > >>>>>>>> The > >>>>>>>> > metric > >>>>>>>> > > >>>>>>>>>>>>>> producers > >>>>>>>> > > >>>>>>>>>>>>>>>>> want to use Counter or Gauge depending on > >>>>>>>> whether the > >>>>>>>> > > >> counter > >>>>>>>> > > >>>>>>> is > >>>>>>>> > > >>>>>>>>>>>>>>>>> already tracked in code, while ideally the > >>>>>>>> metric > >>>>>>>> > > consumers > >>>>>>>> > > >>>>>>> only > >>>>>>>> > > >>>>>>>>>>>>>>>>> want to see a single metric type for each > >>>>>>>> metric. I am > >>>>>>>> > > >>>>> leaning > >>>>>>>> > > >>>>>>>>>>>>>>>>> towards to make the metric producers happy, > >>>>>>>> i.e. allow > >>>>>>>> > > >> Gauge > >>>>>>>> > > >>>>> / > >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter metric type, and the the metric > >>>>>>>> consumers > >>>>>>>> > handle > >>>>>>>> > > >> the > >>>>>>>> > > >>>>>>>>> type > >>>>>>>> > > >>>>>>>>>>>>>>>>> variation. The reason is that in practice, > >>>>>>>> there might > >>>>>>>> > be > >>>>>>>> > > >>>>> more > >>>>>>>> > > >>>>>>>>>>>>>>>>> connector implementations than metric > reporter > >>>>>>>> > > >>>>> implementations. > >>>>>>>> > > >>>>>>>>>>>>>> We > >>>>>>>> > > >>>>>>>>>>>>>>>>> could also provide some helper method to > >>>>>>>> facilitate > >>>>>>>> > > reading > >>>>>>>> > > >>>>>>> from > >>>>>>>> > > >>>>>>>>>>>>>>>>> such variable metric type. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Just some quick replies to the comments > around > >>>>>>>> > > >> implementation > >>>>>>>> > > >>>>>>>>>>>>>>>> details. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> 4) single place where metrics are > >>>>>>>> registered except > >>>>>>>> > > >>>>>>>>>>>>>>>>> connector-specific > >>>>>>>> > > >>>>>>>>>>>>>>>>> ones (which we can't really avoid). > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Register connector specific ones in a single > >>>>>>>> place is > >>>>>>>> > > >>>>> actually > >>>>>>>> > > >>>>>>>>>>>>>>>>> something that I want to achieve. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> 2) I'm talking about time-series databases > >>>>>>>> like > >>>>>>>> > > >>>>> Prometheus. > >>>>>>>> > > >>>>>>>>>>>>>> We > >>>>>>>> > > >>>>>>>>>>>>>>>>> would > >>>>>>>> > > >>>>>>>>>>>>>>>>> only have a gauge metric exposing the last > >>>>>>>> > > >>>>>>>>> fetchTime/emitTime > >>>>>>>> > > >>>>>>>>>>>>>>>>> that is > >>>>>>>> > > >>>>>>>>>>>>>>>>> regularly reported to the backend > >>>>>>>> (Prometheus), > >>>>>>>> > where a > >>>>>>>> > > >>>>>>> user > >>>>>>>> > > >>>>>>>>>>>>>>>>> could build > >>>>>>>> > > >>>>>>>>>>>>>>>>> a histogram of his choosing when/if he > >>>>>>>> wants it. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Not sure if such downsampling works. As an > >>>>>>>> example, if > >>>>>>>> > a > >>>>>>>> > > >> user > >>>>>>>> > > >>>>>>>>>>>>>>>>> complains that there are some intermittent > >>>>>>>> latency > >>>>>>>> > spikes > >>>>>>>> > > >>>>>>> (maybe > >>>>>>>> > > >>>>>>>>>>>>>> a > >>>>>>>> > > >>>>>>>>>>>>>>>>> few records in 10 seconds) in their > >>>>>>>> processing system. > >>>>>>>> > > >>>>> Having a > >>>>>>>> > > >>>>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems > >>>>>>>> unlikely > >>>>>>>> > > useful. > >>>>>>>> > > >>>>>>>>>>>>>>>>> However by looking at actual 99.9 percentile > >>>>>>>> latency > >>>>>>>> > > might > >>>>>>>> > > >>>>>>> help. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay > >>>>>>>> Schepler > >>>>>>>> > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto: > >>>>>>>> ches...@apache.org>> > >>>>>>>> > wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Re: over complication of implementation. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> I think I get understand better know what > >>>>>>>> you're > >>>>>>>> > > >> shooting > >>>>>>>> > > >>>>>>>>>>>>>> for, > >>>>>>>> > > >>>>>>>>>>>>>>>>> effectively something like the > >>>>>>>> OperatorIOMetricGroup. > >>>>>>>> > > >>>>>>>>>>>>>>>>> But still, re-define > >>>>>>>> setupConnectorMetrics() to > >>>>>>>> > accept > >>>>>>>> > > a > >>>>>>>> > > >>>>>>> set > >>>>>>>> > > >>>>>>>>>>>>>>>>> of flags > >>>>>>>> > > >>>>>>>>>>>>>>>>> for counters/meters(ans _possibly_ > >>>>>>>> histograms) along > >>>>>>>> > > >>>>> with a > >>>>>>>> > > >>>>>>>>>>>>>>>>> set of > >>>>>>>> > > >>>>>>>>>>>>>>>>> well-defined Optional<Gauge<?>>, and > return > >>>>>>>> the > >>>>>>>> > group. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Solves all issues as far as i can tell: > >>>>>>>> > > >>>>>>>>>>>>>>>>> 1) no metrics must be created manually > >>>>>>>> (except > >>>>>>>> > Gauges, > >>>>>>>> > > >>>>>>> which > >>>>>>>> > > >>>>>>>>>>>>>>> are > >>>>>>>> > > >>>>>>>>>>>>>>>>> effectively just Suppliers and you can't > >>>>>>>> get around > >>>>>>>> > > >>>>> this), > >>>>>>>> > > >>>>>>>>>>>>>>>>> 2) additional metrics can be registered on > >>>>>>>> the > >>>>>>>> > returned > >>>>>>>> > > >>>>>>>>>>>>>> group, > >>>>>>>> > > >>>>>>>>>>>>>>>>> 3) see 1), > >>>>>>>> > > >>>>>>>>>>>>>>>>> 4) single place where metrics are > >>>>>>>> registered except > >>>>>>>> > > >>>>>>>>>>>>>>>>> connector-specific > >>>>>>>> > > >>>>>>>>>>>>>>>>> ones (which we can't really avoid). > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> Re: Histogram > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> 1) As an example, whether "numRecordsIn" > is > >>>>>>>> exposed > >>>>>>>> > as > >>>>>>>> > > a > >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter or a > >>>>>>>> > > >>>>>>>>>>>>>>>>> Gauge should be irrelevant. So far we're > >>>>>>>> using the > >>>>>>>> > > >> metric > >>>>>>>> > > >>>>>>>>>>>>>> type > >>>>>>>> > > >>>>>>>>>>>>>>>>> that is > >>>>>>>> > > >>>>>>>>>>>>>>>>> the most convenient at exposing a given > >>>>>>>> value. If > >>>>>>>> > there > >>>>>>>> > > >>>>> is > >>>>>>>> > > >>>>>>>>>>>>>>>>> some backing > >>>>>>>> > > >>>>>>>>>>>>>>>>> data-structure that we want to expose some > >>>>>>>> data from > >>>>>>>> > we > >>>>>>>> > > >>>>>>>>>>>>>>>>> typically opt > >>>>>>>> > > >>>>>>>>>>>>>>>>> for a Gauge, as otherwise we're just > >>>>>>>> mucking around > >>>>>>>> > > with > >>>>>>>> > > >>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>>>>> Meter/Counter API to get it to match. > >>>>>>>> Similarly, if > >>>>>>>> > we > >>>>>>>> > > >>>>> want > >>>>>>>> > > >>>>>>>>>>>>>> to > >>>>>>>> > > >>>>>>>>>>>>>>>>> count > >>>>>>>> > > >>>>>>>>>>>>>>>>> something but no current count exists we > >>>>>>>> typically > >>>>>>>> > > added > >>>>>>>> > > >>>>> a > >>>>>>>> > > >>>>>>>>>>>>>>>>> Counter. > >>>>>>>> > > >>>>>>>>>>>>>>>>> That's why attaching semantics to metric > >>>>>>>> types makes > >>>>>>>> > > >>>>> little > >>>>>>>> > > >>>>>>>>>>>>>>>>> sense (but > >>>>>>>> > > >>>>>>>>>>>>>>>>> unfortunately several reporters already do > >>>>>>>> it); for > >>>>>>>> > > >>>>>>>>>>>>>>>>> counters/meters > >>>>>>>> > > >>>>>>>>>>>>>>>>> certainly, but the majority of metrics are > >>>>>>>> gauges. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> 2) I'm talking about time-series databases > >>>>>>>> like > >>>>>>>> > > >>>>> Prometheus. > >>>>>>>> > > >>>>>>>>>>>>>> We > >>>>>>>> > > >>>>>>>>>>>>>>>>> would > >>>>>>>> > > >>>>>>>>>>>>>>>>> only have a gauge metric exposing the last > >>>>>>>> > > >>>>>>>>> fetchTime/emitTime > >>>>>>>> > > >>>>>>>>>>>>>>>>> that is > >>>>>>>> > > >>>>>>>>>>>>>>>>> regularly reported to the backend > >>>>>>>> (Prometheus), > >>>>>>>> > where a > >>>>>>>> > > >>>>>>> user > >>>>>>>> > > >>>>>>>>>>>>>>>>> could build > >>>>>>>> > > >>>>>>>>>>>>>>>>> a histogram of his choosing when/if he > >>>>>>>> wants it. > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> On 22.02.2019 13:57, Becket Qin wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Hi Chesnay, > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks for the explanation. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> ** Re: FLIP > >>>>>>>> > > >>>>>>>>>>>>>>>>>> I might have misunderstood this, but it > >>>>>>>> seems that > >>>>>>>> > > "major > >>>>>>>> > > >>>>>>>>>>>>>>>>> changes" are well > >>>>>>>> > > >>>>>>>>>>>>>>>>>> defined in FLIP. The full contents is > >>>>>>>> following: > >>>>>>>> > > >>>>>>>>>>>>>>>>>> What is considered a "major change" that > >>>>>>>> needs a FLIP? > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Any of the following should be considered a > >>>>>>>> major > >>>>>>>> > > change: > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Any major new feature, subsystem, or > piece > >>>>>>>> of > >>>>>>>> > > >>>>>>>>>>>>>>>>> functionality > >>>>>>>> > > >>>>>>>>>>>>>>>>>> - *Any change that impacts the public > >>>>>>>> interfaces of > >>>>>>>> > the > >>>>>>>> > > >>>>>>>>>>>>>>>>> project* > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> What are the "public interfaces" of the > >>>>>>>> project? > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> *All of the following are public interfaces > >>>>>>>> *that > >>>>>>>> > people > >>>>>>>> > > >>>>>>>>>>>>>>>>> build around: > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> - DataStream and DataSet API, including > >>>>>>>> classes > >>>>>>>> > related > >>>>>>>> > > >>>>>>>>>>>>>>>>> to that, such as > >>>>>>>> > > >>>>>>>>>>>>>>>>>> StreamExecutionEnvironment > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Classes marked with the @Public > annotation > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> - On-disk binary formats, such as > >>>>>>>> > > >>>>>>>>>>>>>> checkpoints/savepoints > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> - User-facing scripts/command-line tools, > >>>>>>>> i.e. > >>>>>>>> > > >>>>>>>>>>>>>>>>> bin/flink, Yarn scripts, > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Mesos scripts > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> - Configuration settings > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> - *Exposed monitoring information* > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> So any monitoring information change is > >>>>>>>> considered as > >>>>>>>> > > >>>>>>>>>>>>>> public > >>>>>>>> > > >>>>>>>>>>>>>>>>> interface, and > >>>>>>>> > > >>>>>>>>>>>>>>>>>> any public interface change is considered > as > >>>>>>>> a "major > >>>>>>>> > > >>>>>>>>>>>>>>> change". > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> ** Re: over complication of implementation. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Although this is more of implementation > >>>>>>>> details that > >>>>>>>> > is > >>>>>>>> > > >> not > >>>>>>>> > > >>>>>>>>>>>>>>>>> covered by the > >>>>>>>> > > >>>>>>>>>>>>>>>>>> FLIP. But it may be worth discussing. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> First of all, I completely agree that we > >>>>>>>> should use > >>>>>>>> > the > >>>>>>>> > > >>>>>>>>>>>>>>>>> simplest way to > >>>>>>>> > > >>>>>>>>>>>>>>>>>> achieve our goal. To me the goal is the > >>>>>>>> following: > >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Clear connector conventions and > >>>>>>>> interfaces. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. The easiness of creating a connector. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Both of them are important to the > prosperity > >>>>>>>> of the > >>>>>>>> > > >>>>>>>>>>>>>>>>> connector ecosystem. So > >>>>>>>> > > >>>>>>>>>>>>>>>>>> I'd rather abstract as much as possible on > >>>>>>>> our side to > >>>>>>>> > > >> make > >>>>>>>> > > >>>>>>>>>>>>>>>>> the connector > >>>>>>>> > > >>>>>>>>>>>>>>>>>> developer's work lighter. Given this goal, > a > >>>>>>>> static > >>>>>>>> > util > >>>>>>>> > > >>>>>>>>>>>>>>>>> method approach > >>>>>>>> > > >>>>>>>>>>>>>>>>>> might have a few drawbacks: > >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Users still have to construct the > metrics > >>>>>>>> by > >>>>>>>> > > >> themselves. > >>>>>>>> > > >>>>>>>>>>>>>>>>> (And note that > >>>>>>>> > > >>>>>>>>>>>>>>>>>> this might be erroneous by itself. For > >>>>>>>> example, a > >>>>>>>> > > customer > >>>>>>>> > > >>>>>>>>>>>>>>>>> wrapper around > >>>>>>>> > > >>>>>>>>>>>>>>>>>> dropwizard meter maybe used instead of > >>>>>>>> MeterView). > >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. When connector specific metrics are > >>>>>>>> added, it is > >>>>>>>> > > >>>>>>>>>>>>>>>>> difficult to enforce > >>>>>>>> > > >>>>>>>>>>>>>>>>>> the scope to be the same as standard > metrics. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> 3. It seems that a method proliferation is > >>>>>>>> inevitable > >>>>>>>> > if > >>>>>>>> > > >> we > >>>>>>>> > > >>>>>>>>>>>>>>>>> want to apply > >>>>>>>> > > >>>>>>>>>>>>>>>>>> sanity checks. e.g. The metric of > numBytesIn > >>>>>>>> was not > >>>>>>>> > > >>>>>>>>>>>>>>>>> registered for a meter. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> 4. Metrics are still defined in random > >>>>>>>> places and hard > >>>>>>>> > > to > >>>>>>>> > > >>>>>>>>>>>>>>>> track. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> The current PR I had was inspired by the > >>>>>>>> Config system > >>>>>>>> > > in > >>>>>>>> > > >>>>>>>>>>>>>>>>> Kafka, which I > >>>>>>>> > > >>>>>>>>>>>>>>>>>> found pretty handy. In fact it is not only > >>>>>>>> used by > >>>>>>>> > Kafka > >>>>>>>> > > >>>>>>>>>>>>>>>>> itself but even > >>>>>>>> > > >>>>>>>>>>>>>>>>>> some other projects that depend on Kafka. I > >>>>>>>> am not > >>>>>>>> > > saying > >>>>>>>> > > >>>>>>>>>>>>>>>>> this approach is > >>>>>>>> > > >>>>>>>>>>>>>>>>>> perfect. But I think it worths to save the > >>>>>>>> work for > >>>>>>>> > > >>>>>>>>>>>>>>>>> connector writers and > >>>>>>>> > > >>>>>>>>>>>>>>>>>> encourage more systematic implementation. > >>>>>>>> That being > >>>>>>>> > > said, > >>>>>>>> > > >>>>>>>>>>>>>> I > >>>>>>>> > > >>>>>>>>>>>>>>>>> am fully open > >>>>>>>> > > >>>>>>>>>>>>>>>>>> to suggestions. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Re: Histogram > >>>>>>>> > > >>>>>>>>>>>>>>>>>> I think there are two orthogonal questions > >>>>>>>> around > >>>>>>>> > those > >>>>>>>> > > >>>>>>>>>>>>>>>> metrics: > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just > >>>>>>>> looking at > >>>>>>>> > the > >>>>>>>> > > >>>>>>>>>>>>>>>>> meaning of a > >>>>>>>> > > >>>>>>>>>>>>>>>>>> metric, is generic to all connectors? If > the > >>>>>>>> answer is > >>>>>>>> > > >> yes, > >>>>>>>> > > >>>>>>>>>>>>>>>>> we should > >>>>>>>> > > >>>>>>>>>>>>>>>>>> include the metric into the convention. No > >>>>>>>> matter > >>>>>>>> > > whether > >>>>>>>> > > >>>>>>>>>>>>>> we > >>>>>>>> > > >>>>>>>>>>>>>>>>> include it > >>>>>>>> > > >>>>>>>>>>>>>>>>>> into the convention or not, some connector > >>>>>>>> > > implementations > >>>>>>>> > > >>>>>>>>>>>>>>>>> will emit such > >>>>>>>> > > >>>>>>>>>>>>>>>>>> metric. It is better to have a convention > >>>>>>>> than letting > >>>>>>>> > > >> each > >>>>>>>> > > >>>>>>>>>>>>>>>>> connector do > >>>>>>>> > > >>>>>>>>>>>>>>>>>> random things. > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. If a standard metric is a histogram, > what > >>>>>>>> should we > >>>>>>>> > > do? > >>>>>>>> > > >>>>>>>>>>>>>>>>>> I agree that we should make it clear that > >>>>>>>> using > >>>>>>>> > > histograms > >>>>>>>> > > >>>>>>>>>>>>>>>>> will have > >>>>>>>> > > >>>>>>>>>>>>>>>>>> performance risk. But I do see histogram is > >>>>>>>> useful in > >>>>>>>> > > some > >>>>>>>> > > >>>>>>>>>>>>>>>>> fine-granularity > >>>>>>>> > > >>>>>>>>>>>>>>>>>> debugging where one do not have the luxury > >>>>>>>> to stop the > >>>>>>>> > > >>>>>>>>>>>>>>>>> system and inject > >>>>>>>> > > >>>>>>>>>>>>>>>>>> more inspection code. So the workaround I > am > >>>>>>>> thinking > >>>>>>>> > is > >>>>>>>> > > >> to > >>>>>>>> > > >>>>>>>>>>>>>>>>> provide some > >>>>>>>> > > >>>>>>>>>>>>>>>>>> implementation suggestions. Assume later on > >>>>>>>> we have a > >>>>>>>> > > >>>>>>>>>>>>>>>>> mechanism of > >>>>>>>> > > >>>>>>>>>>>>>>>>>> selective metrics. In the abstract metrics > >>>>>>>> class we > >>>>>>>> > can > >>>>>>>> > > >>>>>>>>>>>>>>>>> disable those > >>>>>>>> > > >>>>>>>>>>>>>>>>>> metrics by default individual connector > >>>>>>>> writers does > >>>>>>>> > not > >>>>>>>> > > >>>>>>>>>>>>>>>>> have to do > >>>>>>>> > > >>>>>>>>>>>>>>>>>> anything (this is another advantage of > >>>>>>>> having an > >>>>>>>> > > >>>>>>>>>>>>>>>>> AbstractMetrics instead of > >>>>>>>> > > >>>>>>>>>>>>>>>>>> static util methods.) > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> I am not sure I fully understand the > >>>>>>>> histogram in the > >>>>>>>> > > >>>>>>>>>>>>>>>>> backend approach. Can > >>>>>>>> > > >>>>>>>>>>>>>>>>>> you explain a bit more? Do you mean > emitting > >>>>>>>> the raw > >>>>>>>> > > data, > >>>>>>>> > > >>>>>>>>>>>>>>>>> e.g. fetchTime > >>>>>>>> > > >>>>>>>>>>>>>>>>>> and emitTime with each record and let the > >>>>>>>> histogram > >>>>>>>> > > >>>>>>>>>>>>>>>>> computation happen in > >>>>>>>> > > >>>>>>>>>>>>>>>>>> the background? Or let the processing > thread > >>>>>>>> putting > >>>>>>>> > the > >>>>>>>> > > >>>>>>>>>>>>>>>>> values into a > >>>>>>>> > > >>>>>>>>>>>>>>>>>> queue and have a separate thread polling > >>>>>>>> from the > >>>>>>>> > queue > >>>>>>>> > > >> and > >>>>>>>> > > >>>>>>>>>>>>>>>>> add them into > >>>>>>>> > > >>>>>>>>>>>>>>>>>> the histogram? > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay > >>>>>>>> Schepler > >>>>>>>> > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto: > >>>>>>>> ches...@apache.org>> > >>>>>>>> > > wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: Flip > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> The very first line under both the main > >>>>>>>> header and > >>>>>>>> > > >> Purpose > >>>>>>>> > > >>>>>>>>>>>>>>>>> section > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> describe Flips as "major changes", which > >>>>>>>> this isn't. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: complication > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> I'm not arguing against standardization, > >>>>>>>> but again an > >>>>>>>> > > >>>>>>>>>>>>>>>>> over-complicated > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> implementation when a static utility > method > >>>>>>>> would be > >>>>>>>> > > >>>>>>>>>>>>>>>>> sufficient. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> public static void setupConnectorMetrics( > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> String connectorName, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> ...) > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> This gives you all you need: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * a well-defined set of metrics for a > >>>>>>>> connector to > >>>>>>>> > > opt-in > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * standardized naming schemes for scope > and > >>>>>>>> > individual > >>>>>>>> > > >>>>>>>>>>>>>>> metrics > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> * standardize metric types (although > >>>>>>>> personally I'm > >>>>>>>> > not > >>>>>>>> > > >>>>>>>>>>>>>>>>> interested in that > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> since metric types should be considered > >>>>>>>> syntactic > >>>>>>>> > > sugar) > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> Re: Configurable Histogram > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> If anything they _must_ be turned off by > >>>>>>>> default, but > >>>>>>>> > > the > >>>>>>>> > > >>>>>>>>>>>>>>>>> metric system is > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> already exposing so many options that I'm > >>>>>>>> not too > >>>>>>>> > keen > >>>>>>>> > > on > >>>>>>>> > > >>>>>>>>>>>>>>>>> adding even more. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> You have also only addressed my first > >>>>>>>> argument > >>>>>>>> > against > >>>>>>>> > > >>>>>>>>>>>>>>>>> histograms > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> (performance), the second one still stands > >>>>>>>> (calculate > >>>>>>>> > > >>>>>>>>>>>>>>>>> histogram in metric > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> backends instead). > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Hi Chesnay, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is > >>>>>>>> worthy of a > >>>>>>>> > > >> FLIP > >>>>>>>> > > >>>>>>>>>>>>>>>>> because it is > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> public API. According to the FLIP > >>>>>>>> description a FlIP > >>>>>>>> > > is > >>>>>>>> > > >>>>>>>>>>>>>>>>> required in case > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> of: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - Any change that impacts the public > >>>>>>>> interfaces of > >>>>>>>> > > >>>>>>>>>>>>>>>>> the project > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> and the following entry is found in the > >>>>>>>> definition > >>>>>>>> > of > >>>>>>>> > > >>>>>>>>>>>>>>>>> "public interface". > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - Exposed monitoring information > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Metrics are critical to any production > >>>>>>>> system. So a > >>>>>>>> > > >> clear > >>>>>>>> > > >>>>>>>>>>>>>>>>> metric > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> definition > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> is important for any serious users. For > an > >>>>>>>> > > organization > >>>>>>>> > > >>>>>>>>>>>>>>>>> with large Flink > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> installation, change in metrics means > >>>>>>>> great amount > >>>>>>>> > of > >>>>>>>> > > >>>>>>>>>>>>>>>>> work. So such > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> changes > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> do need to be fully discussed and > >>>>>>>> documented. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: Histogram. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> We can discuss whether there is a better > >>>>>>>> way to > >>>>>>>> > expose > >>>>>>>> > > >>>>>>>>>>>>>>>>> metrics that are > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> suitable for histograms. My > >>>>>>>> micro-benchmark on > >>>>>>>> > various > >>>>>>>> > > >>>>>>>>>>>>>>>>> histogram > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> implementations also indicates that they > >>>>>>>> are > >>>>>>>> > > >>>>>>>>>>>>>> significantly > >>>>>>>> > > >>>>>>>>>>>>>>>>> slower than > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> other metric types. But I don't think > that > >>>>>>>> means > >>>>>>>> > never > >>>>>>>> > > >>>>>>>>>>>>>> use > >>>>>>>> > > >>>>>>>>>>>>>>>>> histogram, but > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> means use it with caution. For example, > we > >>>>>>>> can > >>>>>>>> > suggest > >>>>>>>> > > >>>>>>>>>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> implementations > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> to turn them off by default and only turn > >>>>>>>> it on for > >>>>>>>> > a > >>>>>>>> > > >>>>>>>>>>>>>>>>> small amount of > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> time > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> when performing some micro-debugging. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: complication: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Connector conventions are essential for > >>>>>>>> Flink > >>>>>>>> > > ecosystem. > >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink connectors > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> pool is probably the most important part > >>>>>>>> of Flink, > >>>>>>>> > > just > >>>>>>>> > > >>>>>>>>>>>>>>>>> like any other > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> data > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> system. Clear conventions of connectors > >>>>>>>> will help > >>>>>>>> > > build > >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink ecosystem > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> in > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a more organic way. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Take the metrics convention as an > example, > >>>>>>>> imagine > >>>>>>>> > > >>>>>>>>>>>>>> someone > >>>>>>>> > > >>>>>>>>>>>>>>>>> has developed > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> a > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Flink connector for System foo, and > another > >>>>>>>> > developer > >>>>>>>> > > >> may > >>>>>>>> > > >>>>>>>>>>>>>>>>> have developed > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> a > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for > >>>>>>>> Flink which > >>>>>>>> > > >>>>>>>>>>>>>>>>> analyzes the Flink > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> job > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> performance based on metrics. With a > clear > >>>>>>>> metric > >>>>>>>> > > >>>>>>>>>>>>>>>>> convention, those two > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> projects could be developed > independently. > >>>>>>>> Once > >>>>>>>> > users > >>>>>>>> > > >> put > >>>>>>>> > > >>>>>>>>>>>>>>>>> them together, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> it would work without additional > >>>>>>>> modifications. This > >>>>>>>> > > >>>>>>>>>>>>>>>>> cannot be easily > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> achieved by just defining a few > constants. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: selective metrics: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate > >>>>>>>> thread. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> @Dawid > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> The primary purpose of establish such a > >>>>>>>> convention > >>>>>>>> > is > >>>>>>>> > > to > >>>>>>>> > > >>>>>>>>>>>>>>>>> help developers > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> write connectors in a more compatible > way. > >>>>>>>> The > >>>>>>>> > > >> convention > >>>>>>>> > > >>>>>>>>>>>>>>>>> is supposed to > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> be > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> defined more proactively. So when look at > >>>>>>>> the > >>>>>>>> > > >> convention, > >>>>>>>> > > >>>>>>>>>>>>>>>>> it seems more > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> important to see if the concept is > >>>>>>>> applicable to > >>>>>>>> > > >>>>>>>>>>>>>>>>> connectors in general. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> It > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> might be true so far only Kafka connector > >>>>>>>> reports > >>>>>>>> > > >>>>>>>>>>>>>> latency. > >>>>>>>> > > >>>>>>>>>>>>>>>>> But there > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> might > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> be hundreds of other connector > >>>>>>>> implementations in > >>>>>>>> > the > >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink ecosystem, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of > >>>>>>>> them also > >>>>>>>> > > >> emits > >>>>>>>> > > >>>>>>>>>>>>>>>>> latency. I > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> think > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a lot of other sources actually also has > >>>>>>>> an append > >>>>>>>> > > >>>>>>>>>>>>>>>>> timestamp. e.g. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> database > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I > >>>>>>>> wouldn't be > >>>>>>>> > > surprised > >>>>>>>> > > >>>>>>>>>>>>>>>>> if some database > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> connector can also emit latency metrics. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay > >>>>>>>> Schepler > >>>>>>>> > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto: > >>>>>>>> ches...@apache.org>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to > >>>>>>>> investigate > >>>>>>>> > > this > >>>>>>>> > > >>>>>>>>>>>>>> as > >>>>>>>> > > >>>>>>>>>>>>>>>>> part of this > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of > >>>>>>>> interest > >>>>>>>> > for > >>>>>>>> > > >>>>>>>>>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>>>>> entire metric > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> system, and should be designed for as > >>>>>>>> such. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to > >>>>>>>> the core of > >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink. They are > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> significantly more expensive than other > >>>>>>>> metrics, > >>>>>>>> > and > >>>>>>>> > > >>>>>>>>>>>>>>>>> calculating > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> histograms in the application is > regarded > >>>>>>>> as an > >>>>>>>> > > >>>>>>>>>>>>>>>>> anti-pattern by several > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> metric backends, who instead recommend > to > >>>>>>>> expose > >>>>>>>> > the > >>>>>>>> > > >> raw > >>>>>>>> > > >>>>>>>>>>>>>>>>> data and > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> calculate the histogram in the backend. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Second, this seems overly complicated. > >>>>>>>> Given that > >>>>>>>> > we > >>>>>>>> > > >>>>>>>>>>>>>>>>> already established > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> that not all connectors will export all > >>>>>>>> metrics we > >>>>>>>> > > are > >>>>>>>> > > >>>>>>>>>>>>>>>>> effectively > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> reducing this down to a consistent > naming > >>>>>>>> scheme. > >>>>>>>> > We > >>>>>>>> > > >>>>>>>>>>>>>>>>> don't need anything > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a > >>>>>>>> few > >>>>>>>> > > constants > >>>>>>>> > > >>>>>>>>>>>>>>>>> that all > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connectors use. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of > >>>>>>>> a FLIP. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz > >>>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I > >>>>>>>> agree. But > >>>>>>>> > those > >>>>>>>> > > >>>>>>>>>>>>>>>>> particular metrics > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now > would > >>>>>>>> only be > >>>>>>>> > > >>>>>>>>>>>>>>>>> reported if user uses > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal > >>>>>>>> timestampAssigner with > >>>>>>>> > > >>>>>>>>>>>>>>>>> StreamCharacteristic > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds > >>>>>>>> like a very > >>>>>>>> > > >>>>>>>>>>>>>>>>> specific case. I am > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> not > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic > >>>>>>>> metric that > >>>>>>>> > > will > >>>>>>>> > > >>>>>>>>>>>>>> be > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> disabled/absent for most of > >>>>>>>> implementations. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal > >>>>>>>> issue, that > >>>>>>>> > > might > >>>>>>>> > > >>>>>>>>>>>>>>>>> make sense to > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> investigate in the future. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Best, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Dawid > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Hi Dawid, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes > >>>>>>>> sense to me. > >>>>>>>> > > >> There > >>>>>>>> > > >>>>>>>>>>>>>>>>> are two cases > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> to > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> be > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> addressed. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a > >>>>>>>> guidance. It > >>>>>>>> > is > >>>>>>>> > > >>>>>>>>>>>>>>>>> likely that a > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connector > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> only supports some but not all of the > >>>>>>>> metrics. In > >>>>>>>> > > >> that > >>>>>>>> > > >>>>>>>>>>>>>>>>> case, each > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> connector > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> implementation should have the freedom > >>>>>>>> to decide > >>>>>>>> > > >> which > >>>>>>>> > > >>>>>>>>>>>>>>>>> metrics are > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are > >>>>>>>> supported, the > >>>>>>>> > > >>>>>>>>>>>>>>>>> guidance should be > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> followed. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable > >>>>>>>> certain > >>>>>>>> > > >> metrics > >>>>>>>> > > >>>>>>>>>>>>>>>>> for some reason > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of > >>>>>>>> data). A > >>>>>>>> > > generic > >>>>>>>> > > >>>>>>>>>>>>>>>>> mechanism should > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> be > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> provided to allow user choose which > >>>>>>>> metrics are > >>>>>>>> > > >>>>>>>>>>>>>>>>> reported. This > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> mechanism > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> should also be honored by the > connector > >>>>>>>> > > >>>>>>>>>>>>>> implementations. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you? > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid > >>>>>>>> Wysakowicz > >>>>>>>> > < > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto: > >>>>>>>> > > dwysakow...@apache.org > >>>>>>>> > > >>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a > >>>>>>>> unified, > >>>>>>>> > > >>>>>>>>>>>>>>>>> standard set of > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> metrics > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> for > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight > >>>>>>>> concerns > >>>>>>>> > about > >>>>>>>> > > >>>>>>>>>>>>>>>>> fetchLatency and > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> latency though. They are computed > >>>>>>>> based on > >>>>>>>> > > EventTime > >>>>>>>> > > >>>>>>>>>>>>>>>>> which is not a > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> purely > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often > on > >>>>>>>> some > >>>>>>>> > > business > >>>>>>>> > > >>>>>>>>>>>>>>>>> logic, might be > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> absent > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> or defined after source. Those > metrics > >>>>>>>> could > >>>>>>>> > also > >>>>>>>> > > >>>>>>>>>>>>>>>>> behave in a weird > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> way in > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore > I > >>>>>>>> am not > >>>>>>>> > sure > >>>>>>>> > > >> if > >>>>>>>> > > >>>>>>>>>>>>>>>>> we should > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> include > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we > >>>>>>>> could at > >>>>>>>> > least > >>>>>>>> > > >>>>>>>>>>>>>>>>> introduce a feature > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think? > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Best, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Dawid > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin > wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to > the > >>>>>>>> proposed > >>>>>>>> > > >>>>>>>>>>>>>>>>> metrics. I'll start a > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> voting thread later toady. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM > Becket > >>>>>>>> Qin > >>>>>>>> > > >>>>>>>>>>>>>>>>> <becket....@gmail.com <mailto: > >>>>>>>> becket....@gmail.com>> > >>>>>>>> > < > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto: > >>>>>>>> becket....@gmail.com > >>>>>>>> > >> > >>>>>>>> > > >>>>>>>>>>>>>>> wrote: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi folks, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP > >>>>>>>> discussion thread > >>>>>>>> > > >>>>>>>>>>>>>> about > >>>>>>>> > > >>>>>>>>>>>>>>>>> standardize > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> the > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> connector metrics. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a > >>>>>>>> convention > >>>>>>>> > of > >>>>>>>> > > >>>>>>>>>>>>>>>>> Flink connector > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the > >>>>>>>> monitoring > >>>>>>>> > and > >>>>>>>> > > >>>>>>>>>>>>>>>>> alerting on Flink > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> jobs. > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> The FLIP link is following: > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>> > >>>>>>>> > > >> > >>>>>>>> > > > >>>>>>>> > > >>>>>>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>>>>> > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>>>> > >>>>>>>> > > >>>>> > >>>>>>>> > > >>>>> > >>>>>>>> > > >> > >>>>>>>> > > >> > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> Konstantin Knauf > >>>>>>>> > >>>>>>>> https://twitter.com/snntrable > >>>>>>>> > >>>>>>>> https://github.com/knaufk > >>>>>>>> > >>>>>>> > -- Konstantin Knauf https://twitter.com/snntrable https://github.com/knaufk