On Tue, Sep 8, 2020 at 6:55 PM Konstantin Knauf <kna...@apache.org> wrote:
> Hi Becket, > > Thank you for picking up this FLIP. I have a few questions: > > * two thoughts on naming: > * idleTime: In the meantime, a similar metric "idleTimeMsPerSecond" has > been introduced in https://issues.apache.org/jira/browse/FLINK-16864. They > have a similar name, but different definitions of idleness, > e.g. "idleTimeMsPerSecond" considers the SourceTask idle, when it is > backpressured. Can we make it clearer that these two metrics mean different > things? > > * "current(Fetch)Latency" I am wondering if "eventTimeLag(Before|After)" > is more descriptive/clear. What do others think? > > * Current(Fetch)Latency implies that the timestamps are directly > extracted in the source connector, right? Will this be the default for > FLIP-27 sources anyway? > > * Does FLIP-33 also include the implementation of these metrics (to the > extent possible) for all connectors currently available in Apache Flink or > is the "per-connector implementation" out-of-scope? > > Thanks, > > Konstantin > > > > > > On Fri, Sep 4, 2020 at 4:56 PM Becket Qin <becket....@gmail.com> wrote: > > > Hi all, > > > > To complete the Source refactoring work, I'd like to revive this > > discussion. Since the mail thread has been dormant for more than a year, > > just to recap the motivation of the FLIP: > > > > 1. The FLIP proposes to standardize the connector metrics by giving > > guidance on the metric specifications, including the name, type and > meaning > > of the metrics. > > 2. It is OK for a connector to not emit some of the metrics in the metric > > guidance, but if a metric of the same semantic is emitted, the > > implementation should follow the guidance. > > 3. It is OK for a connector to emit more metrics than what are listed in > > the FLIP. This includes having an alias for a metric specified in the > FLIP. > > 4. We will implement some of the metrics out of the box in the default > > implementation of FLIP-27, as long as it is applicable. > > > > The FLIP wiki is following: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33 > > %3A+Standardize+Connector+Metrics > > > > Any thoughts? > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > > > On Fri, Jun 14, 2019 at 2:29 PM Piotr Nowojski <pi...@ververica.com> > > wrote: > > > > > > we will need to revisit the convention list and adjust them > accordingly > > > when FLIP-27 is ready > > > > > > > > > Yes, this sounds good :) > > > > > > Piotrek > > > > > > > On 13 Jun 2019, at 13:35, Becket Qin <becket....@gmail.com> wrote: > > > > > > > > Hi Piotr, > > > > > > > > That's great to know. Chances are that we will need to revisit the > > > > convention list and adjust them accordingly when FLIP-27 is ready, At > > > that > > > > point we can mark some of the metrics as available by default for > > > > connectors implementing the new interface. > > > > > > > > Thanks, > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > On Thu, Jun 13, 2019 at 3:51 PM Piotr Nowojski <pi...@ververica.com> > > > wrote: > > > > > > > >> Thanks for driving this. I’ve just noticed one small thing. With new > > > >> SourceReader interface Flink will be able to provide `idleTime` > metric > > > >> automatically. > > > >> > > > >> Piotrek > > > >> > > > >>> On 13 Jun 2019, at 03:30, Becket Qin <becket....@gmail.com> wrote: > > > >>> > > > >>> Thanks all for the feedback and discussion. > > > >>> > > > >>> Since there wasn't any concern raised, I've started the voting > thread > > > for > > > >>> this FLIP, but please feel free to continue the discussion here if > > you > > > >>> think something still needs to be addressed. > > > >>> > > > >>> Thanks, > > > >>> > > > >>> Jiangjie (Becket) Qin > > > >>> > > > >>> > > > >>> > > > >>> On Mon, Jun 10, 2019 at 9:10 AM Becket Qin <becket....@gmail.com> > > > wrote: > > > >>> > > > >>>> Hi Piotr, > > > >>>> > > > >>>> Thanks for the comments. Yes, you are right. Users will have to > look > > > at > > > >>>> other metrics to decide whether the pipeline is healthy or not in > > the > > > >> first > > > >>>> place before they can use the time-based metric to fix the > > bottleneck. > > > >>>> > > > >>>> I agree that once we have FLIP-27 ready, some of the metrics can > > just > > > be > > > >>>> reported by the abstract implementation. > > > >>>> > > > >>>> I've updated FLIP-33 wiki page to add the pendingBytes and > > > >> pendingRecords > > > >>>> metric. Please let me know if you have any concern over the > updated > > > >> metric > > > >>>> convention proposal. > > > >>>> > > > >>>> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen > > > >>>> <step...@ververica.com> will you also have time to take a look at > > the > > > >>>> proposed metric convention? If there is no further concern I'll > > start > > > a > > > >>>> voting thread for this FLIP in two days. > > > >>>> > > > >>>> Thanks, > > > >>>> > > > >>>> Jiangjie (Becket) Qin > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski < > pi...@ververica.com> > > > >> wrote: > > > >>>> > > > >>>>> Hi Becket, > > > >>>>> > > > >>>>> Thanks for the answer :) > > > >>>>> > > > >>>>>> By time-based metric, I meant the portion of time spent on > > producing > > > >> the > > > >>>>>> record to downstream. For example, a source connector can report > > > that > > > >>>>> it's > > > >>>>>> spending 80% of time to emit record to downstream processing > > > pipeline. > > > >>>>> In > > > >>>>>> another case, a sink connector may report that its spending 30% > of > > > >> time > > > >>>>>> producing the records to the external system. > > > >>>>>> > > > >>>>>> This is in some sense equivalent to the buffer usage metric: > > > >>>>> > > > >>>>>> - 80% of time spent on emitting records to downstream ---> > > > downstream > > > >>>>>> node is bottleneck ---> output buffer is probably full. > > > >>>>>> - 30% of time spent on emitting records to downstream ---> > > > downstream > > > >>>>>> node is not bottleneck ---> output buffer is probably not full. > > > >>>>> > > > >>>>> If by “time spent on emitting records to downstream” you > understand > > > >>>>> “waiting on back pressure”, then I see your point. And I agree > that > > > >> some > > > >>>>> kind of ratio/time based metric gives you more information. > However > > > >> under > > > >>>>> “time spent on emitting records to downstream” might be hidden > the > > > >>>>> following (extreme) situation: > > > >>>>> > > > >>>>> 1. Job is barely able to handle influx of records, there is 99% > > > >>>>> CPU/resource usage in the cluster, but nobody is > > > >>>>> bottlenecked/backpressured, all output buffers are empty, > everybody > > > is > > > >>>>> waiting in 1% of it’s time for more records to process. > > > >>>>> 2. 80% time can still be spent on "down stream operators”, > because > > > they > > > >>>>> are the CPU intensive operations, but this doesn’t mean that > > > >> increasing the > > > >>>>> parallelism down the stream will help with anything there. To the > > > >> contrary, > > > >>>>> increasing parallelism of the source operator might help to > > increase > > > >>>>> resource utilisation up to 100%. > > > >>>>> > > > >>>>> However, this “time based/ratio” approach can be extended to > > > in/output > > > >>>>> buffer usage. Besides collecting an information that input/output > > > >> buffer is > > > >>>>> full/empty, we can probe profile how often are buffer empty/full. > > If > > > >> output > > > >>>>> buffer is full 1% of times, there is almost no back pressure. If > > it’s > > > >> full > > > >>>>> 80% of times, there is some back pressure, if it’s full 99.9% of > > > times, > > > >>>>> there is huge back pressure. > > > >>>>> > > > >>>>> Now for autoscaling you could compare the input & output buffers > > fill > > > >>>>> ratio: > > > >>>>> > > > >>>>> 1. Both are high, the source of bottleneck is down the stream > > > >>>>> 2. Output is low, input is high, this is the bottleneck and the > > > higher > > > >>>>> the difference, the bigger source of bottleneck is this is > > > >> operator/task > > > >>>>> 3. Output is high, input is low - there was some load spike that > we > > > are > > > >>>>> currently finishing to process > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> But long story short, we are probably diverging from the topic of > > > this > > > >>>>> discussion, and we can discuss this at some later point. > > > >>>>> > > > >>>>> For now, for sources: > > > >>>>> > > > >>>>> as I wrote before, +1 for: > > > >>>>> - pending.bytes, Gauge > > > >>>>> - pending.messages, Gauge > > > >>>>> > > > >>>>> When we will be developing/discussing SourceReader from FLIP-27 > we > > > >> might > > > >>>>> then add: > > > >>>>> > > > >>>>> - in-memory.buffer.usage (0 - 100%) > > > >>>>> > > > >>>>> Which will be estimated automatically by Flink while user will be > > > able > > > >> to > > > >>>>> override/provide better estimation. > > > >>>>> > > > >>>>> Piotrek > > > >>>>> > > > >>>>>> On 5 Jun 2019, at 05:42, Becket Qin <becket....@gmail.com> > wrote: > > > >>>>>> > > > >>>>>> Hi Piotr, > > > >>>>>> > > > >>>>>> Thanks for the explanation. Please see some clarifications > below. > > > >>>>>> > > > >>>>>> By time-based metric, I meant the portion of time spent on > > producing > > > >> the > > > >>>>>> record to downstream. For example, a source connector can report > > > that > > > >>>>> it's > > > >>>>>> spending 80% of time to emit record to downstream processing > > > pipeline. > > > >>>>> In > > > >>>>>> another case, a sink connector may report that its spending 30% > of > > > >> time > > > >>>>>> producing the records to the external system. > > > >>>>>> > > > >>>>>> This is in some sense equivalent to the buffer usage metric: > > > >>>>>> - 80% of time spent on emitting records to downstream ---> > > > downstream > > > >>>>>> node is bottleneck ---> output buffer is probably full. > > > >>>>>> - 30% of time spent on emitting records to downstream ---> > > > downstream > > > >>>>>> node is not bottleneck ---> output buffer is probably not full. > > > >>>>>> > > > >>>>>> However, the time-based metric has a few advantages that the > > buffer > > > >>>>> usage > > > >>>>>> metric may not have. > > > >>>>>> > > > >>>>>> 1. Buffer usage metric may not be applicable to all the > connector > > > >>>>>> implementations, while reporting time-based metric are always > > > doable. > > > >>>>>> Some source connectors may not have any input buffer, or they > may > > > use > > > >>>>> some > > > >>>>>> third party library that does not expose the input buffer at > all. > > > >>>>>> Similarly, for sink connectors, the implementation may not have > > any > > > >>>>> output > > > >>>>>> buffer, or the third party library does not expose such buffer. > > > >>>>>> > > > >>>>>> 2. Although both type of metrics can detect bottleneck, > time-based > > > >>>>> metrics > > > >>>>>> can be used to generate a more informed action to remove the > > > >> bottleneck. > > > >>>>>> For example, when the downstream is bottleneck, the output > buffer > > > >> usage > > > >>>>>> metric is likely to be 100%, and the input buffer usage might be > > 0%. > > > >>>>> That > > > >>>>>> means we don't know what is the suitable parallelism to lift the > > > >>>>>> bottleneck. The time-based metric, on the other hand, would give > > > >> useful > > > >>>>>> information, e.g. if 80% of time was spent on emitting records, > we > > > can > > > >>>>>> roughly increase the downstream node parallelism by 4 times. > > > >>>>>> > > > >>>>>> Admittedly, the time-based metrics are more expensive than > buffer > > > >>>>> usage. So > > > >>>>>> we may have to do some sampling to reduce the cost. But in > > general, > > > >>>>> using > > > >>>>>> time-based metrics seems worth adding. > > > >>>>>> > > > >>>>>> That being said, I don't think buffer usage metric and > time-based > > > >>>>> metrics > > > >>>>>> are mutually exclusive. We can probably have both. It is just > that > > > in > > > >>>>>> practice, features like auto-scaling might prefer time-based > > metrics > > > >> for > > > >>>>>> the reason stated above. > > > >>>>>> > > > >>>>>>> 1. Define the metrics that would allow us to manually detect > > > >>>>> bottlenecks. > > > >>>>>> As I wrote, we already have them in most of the places, except > of > > > >>>>>> sources/sinks. > > > >>>>>>> 2. Use those metrics, to automatically detect bottlenecks. > > > Currently > > > >> we > > > >>>>>> are only automatically detecting back pressure and reporting it > to > > > the > > > >>>>> user > > > >>>>>> in web UI (is it exposed as a metric at all?). Detecting the > root > > > >> cause > > > >>>>> of > > > >>>>>> the back pressure (bottleneck) is one step further. > > > >>>>>>> 3. Use the knowledge about where exactly the bottleneck is > > located, > > > >> to > > > >>>>>> try to do something with it. > > > >>>>>> > > > >>>>>> As explained above, I think time-based metric also addresses > item > > 1 > > > >> and > > > >>>>>> item 2. > > > >>>>>> > > > >>>>>> Any thoughts? > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> > > > >>>>>> Jiangjie (Becket) Qin > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski < > > pi...@ververica.com> > > > >>>>> wrote: > > > >>>>>> > > > >>>>>>> Hi again :) > > > >>>>>>> > > > >>>>>>>> - pending.bytes, Gauge > > > >>>>>>>> - pending.messages, Gauge > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> +1 > > > >>>>>>> > > > >>>>>>> And true, instead of overloading one of the metric it is better > > > when > > > >>>>> user > > > >>>>>>> can choose to provide only one of them. > > > >>>>>>> > > > >>>>>>> Re 2: > > > >>>>>>> > > > >>>>>>>> If I understand correctly, this metric along with the pending > > > >> mesages > > > >>>>> / > > > >>>>>>>> bytes would answer the questions of: > > > >>>>>>> > > > >>>>>>>> - Does the connector consume fast enough? Lagging behind + > empty > > > >>>>> buffer > > > >>>>>>> = > > > >>>>>>>> cannot consume fast enough. > > > >>>>>>>> - Does the connector emit fast enough? Lagging behind + full > > > buffer > > > >> = > > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow. > > > >>>>>>> > > > >>>>>>> Yes, exactly. This can also be used to support decisions like > > > >> changing > > > >>>>> the > > > >>>>>>> parallelism of the sources and/or down stream operators. > > > >>>>>>> > > > >>>>>>> I’m not sure if I understand your proposal with time based > > > >>>>> measurements. > > > >>>>>>> Maybe I’m missing something, but I do not see how measuring > time > > > >> alone > > > >>>>>>> could answer the problem: where is the bottleneck. Time spent > on > > > the > > > >>>>>>> next/emit might be short or long (depending on how heavy to > > process > > > >> the > > > >>>>>>> record is) and the source can still be bottlenecked/back > > pressured > > > or > > > >>>>> not. > > > >>>>>>> Usually the easiest and the most reliable way how to detect > > > >>>>> bottlenecks is > > > >>>>>>> by checking usage of input & output buffers, since when input > > > buffer > > > >> is > > > >>>>>>> full while output buffer is empty, that’s the definition of a > > > >>>>> bottleneck. > > > >>>>>>> Also this is usually very easy and cheap to measure (it works > > > >>>>> effectively > > > >>>>>>> the same way as current’s Flink back pressure monitoring, but > > more > > > >>>>> cleanly, > > > >>>>>>> without probing thread’s stack traces). > > > >>>>>>> > > > >>>>>>> Also keep in mind that we are already using the buffer usage > > > metrics > > > >>>>> for > > > >>>>>>> detecting the bottlenecks in Flink’s internal network exchanges > > > >> (manual > > > >>>>>>> work). That’s the reason why I wanted to extend this to > > > >> sources/sinks, > > > >>>>>>> since they are currently our blind spot. > > > >>>>>>> > > > >>>>>>>> One feature we are currently working on to scale Flink > > > automatically > > > >>>>>>> relies > > > >>>>>>>> on some metrics answering the same question > > > >>>>>>> > > > >>>>>>> That would be very helpful feature. I think in order to achieve > > > that > > > >> we > > > >>>>>>> would need to: > > > >>>>>>> 1. Define the metrics that would allow us to manually detect > > > >>>>> bottlenecks. > > > >>>>>>> As I wrote, we already have them in most of the places, except > of > > > >>>>>>> sources/sinks. > > > >>>>>>> 2. Use those metrics, to automatically detect bottlenecks. > > > Currently > > > >> we > > > >>>>>>> are only automatically detecting back pressure and reporting it > > to > > > >> the > > > >>>>> user > > > >>>>>>> in web UI (is it exposed as a metric at all?). Detecting the > root > > > >>>>> cause of > > > >>>>>>> the back pressure (bottleneck) is one step further. > > > >>>>>>> 3. Use the knowledge about where exactly the bottleneck is > > located, > > > >> to > > > >>>>> try > > > >>>>>>> to do something with it. > > > >>>>>>> > > > >>>>>>> I think you are aiming for point 3., but before we reach it, we > > are > > > >>>>> still > > > >>>>>>> missing 1. & 2. Also even if we have 3., there is a value in 1 > & > > 2 > > > >> for > > > >>>>>>> manual analysis/dashboards. > > > >>>>>>> > > > >>>>>>> However, having the knowledge of where the bottleneck is, > doesn’t > > > >>>>>>> necessarily mean that we know what we can do about it. For > > example > > > >>>>>>> increasing parallelism might or might not help with anything > > (data > > > >>>>> skew, > > > >>>>>>> bottleneck on some resource that does not scale), but this > remark > > > >>>>> applies > > > >>>>>>> always, regardless of the way how did we detect the bottleneck. > > > >>>>>>> > > > >>>>>>> Piotrek > > > >>>>>>> > > > >>>>>>>> On 3 Jun 2019, at 06:16, Becket Qin <becket....@gmail.com> > > wrote: > > > >>>>>>>> > > > >>>>>>>> Hi Piotr, > > > >>>>>>>> > > > >>>>>>>> Thanks for the suggestion. Some thoughts below: > > > >>>>>>>> > > > >>>>>>>> Re 1: The pending messages / bytes. > > > >>>>>>>> I completely agree these are very useful metrics and we should > > > >> expect > > > >>>>> the > > > >>>>>>>> connector to report. WRT the way to expose them, it seems more > > > >>>>> consistent > > > >>>>>>>> to add two metrics instead of adding a method (unless there > are > > > >> other > > > >>>>> use > > > >>>>>>>> cases other than metric reporting). So we can add the > following > > > two > > > >>>>>>> metrics. > > > >>>>>>>> - pending.bytes, Gauge > > > >>>>>>>> - pending.messages, Gauge > > > >>>>>>>> Applicable connectors can choose to report them. These two > > metrics > > > >>>>> along > > > >>>>>>>> with latency should be sufficient for users to understand the > > > >> progress > > > >>>>>>> of a > > > >>>>>>>> connector. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Re 2: Number of buffered data in-memory of the connector > > > >>>>>>>> If I understand correctly, this metric along with the pending > > > >> mesages > > > >>>>> / > > > >>>>>>>> bytes would answer the questions of: > > > >>>>>>>> - Does the connector consume fast enough? Lagging behind + > empty > > > >>>>> buffer > > > >>>>>>> = > > > >>>>>>>> cannot consume fast enough. > > > >>>>>>>> - Does the connector emit fast enough? Lagging behind + full > > > buffer > > > >> = > > > >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow. > > > >>>>>>>> > > > >>>>>>>> One feature we are currently working on to scale Flink > > > automatically > > > >>>>>>> relies > > > >>>>>>>> on some metrics answering the same question, more > specifically, > > we > > > >> are > > > >>>>>>>> profiling the time spent on .next() method (time to consume) > and > > > the > > > >>>>> time > > > >>>>>>>> spent on .collect() method (time to emit / process). One > > advantage > > > >> of > > > >>>>>>> such > > > >>>>>>>> method level time cost allows us to calculate the parallelism > > > >>>>> required to > > > >>>>>>>> keep up in case their is a lag. > > > >>>>>>>> > > > >>>>>>>> However, one concern I have regarding such metric is that they > > are > > > >>>>>>>> implementation specific. Either profiling on the method time, > or > > > >>>>>>> reporting > > > >>>>>>>> buffer usage assumes the connector are implemented in such a > > way. > > > A > > > >>>>>>>> slightly better solution might be have the following metric: > > > >>>>>>>> > > > >>>>>>>> - EmitTimeRatio (or FetchTimeRatio): The time spent on > emitting > > > >>>>>>>> records / Total time elapsed. > > > >>>>>>>> > > > >>>>>>>> This assumes that the source connectors have to emit the > records > > > to > > > >>>>> the > > > >>>>>>>> downstream at some point. The emission may take some time ( > e.g. > > > go > > > >>>>>>> through > > > >>>>>>>> chained operators). And the rest of the time are spent to > > prepare > > > >> the > > > >>>>>>>> record to emit, including time for consuming and format > > > conversion, > > > >>>>> etc. > > > >>>>>>>> Ideally, we'd like to see the time spent on record fetch and > > emit > > > to > > > >>>>> be > > > >>>>>>>> about the same, so no one is bottleneck for the other. > > > >>>>>>>> > > > >>>>>>>> The downside of these time based metrics is additional > overhead > > to > > > >> get > > > >>>>>>> the > > > >>>>>>>> time, therefore sampling might be needed. But in practice I > feel > > > >> such > > > >>>>>>> time > > > >>>>>>>> based metric might be more useful if we want to take action. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> I think we should absolutely add metrics in (1) to the metric > > > >>>>> convention. > > > >>>>>>>> We could also add the metrics mentioned in (2) if we reach > > > consensus > > > >>>>> on > > > >>>>>>>> that. What do you think? > > > >>>>>>>> > > > >>>>>>>> Thanks, > > > >>>>>>>> > > > >>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski < > > > pi...@ververica.com > > > >>> > > > >>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Hey Becket, > > > >>>>>>>>> > > > >>>>>>>>> Re 1a) and 1b) +1 from my side. > > > >>>>>>>>> > > > >>>>>>>>> I’ve discussed this issue: > > > >>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us to > check > > > the > > > >>>>> cause > > > >>>>>>>>> of > > > >>>>>>>>>>>> back pressure: > > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes? Or > boolean > > > >>>>>>>>>>>> hasSomethingl/isEmpty) > > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean > > > >>>>>>>>>>>> hasSomething/isEmpty) > > > >>>>>>>>> > > > >>>>>>>>> With Nico at some lengths and he also saw the benefits of > them. > > > We > > > >>>>> also > > > >>>>>>>>> have more concrete proposal for that. > > > >>>>>>>>> > > > >>>>>>>>> Actually there are two really useful metrics, that we are > > missing > > > >>>>>>>>> currently: > > > >>>>>>>>> > > > >>>>>>>>> 1. Number of data/records/bytes in the backlog to process. > For > > > >>>>> example > > > >>>>>>>>> remaining number of bytes in unread files. Or pending data in > > > Kafka > > > >>>>>>> topics. > > > >>>>>>>>> 2. Number of buffered data in-memory of the connector, that > are > > > >>>>> waiting > > > >>>>>>> to > > > >>>>>>>>> be processed pushed to Flink pipeline. > > > >>>>>>>>> > > > >>>>>>>>> Re 1: > > > >>>>>>>>> This would have to be a metric provided directly by a > > connector. > > > It > > > >>>>>>> could > > > >>>>>>>>> be an undefined `int`: > > > >>>>>>>>> > > > >>>>>>>>> `int backlog` - estimate of pending work. > > > >>>>>>>>> > > > >>>>>>>>> “Undefined” meaning that it would be up to a connector to > > decided > > > >>>>>>> whether > > > >>>>>>>>> it’s measured in bytes, records, pending files or whatever it > > is > > > >>>>>>> possible > > > >>>>>>>>> to provide by the connector. This is because I assume not > every > > > >>>>>>> connector > > > >>>>>>>>> can provide exact number and for some of them it might be > > > >> impossible > > > >>>>> to > > > >>>>>>>>> provide records number of bytes count. > > > >>>>>>>>> > > > >>>>>>>>> Re 2: > > > >>>>>>>>> This metric could be either provided by a connector, or > > > calculated > > > >>>>>>> crudely > > > >>>>>>>>> by Flink: > > > >>>>>>>>> > > > >>>>>>>>> `float bufferUsage` - value from [0.0, 1.0] range > > > >>>>>>>>> > > > >>>>>>>>> Percentage of used in memory buffers, like in Kafka’s > handover. > > > >>>>>>>>> > > > >>>>>>>>> It could be crudely implemented by Flink with FLIP-27 > > > >>>>>>>>> SourceReader#isAvailable. If SourceReader is not available > > > reported > > > >>>>>>>>> `bufferUsage` could be 0.0. If it is available, it could be > > 1.0. > > > I > > > >>>>> think > > > >>>>>>>>> this would be a good enough estimation for most of the use > > cases > > > >>>>> (that > > > >>>>>>>>> could be overloaded and implemented better if desired). > > > Especially > > > >>>>>>> since we > > > >>>>>>>>> are reporting only probed values: if probed values are almost > > > >> always > > > >>>>>>> “1.0”, > > > >>>>>>>>> it would mean that we have a back pressure. If they are > almost > > > >> always > > > >>>>>>>>> “0.0”, there is probably no back pressure at the sources. > > > >>>>>>>>> > > > >>>>>>>>> What do you think about this? > > > >>>>>>>>> > > > >>>>>>>>> Piotrek > > > >>>>>>>>> > > > >>>>>>>>>> On 30 May 2019, at 11:41, Becket Qin <becket....@gmail.com> > > > >> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> Hi all, > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks a lot for all the feedback and comments. I'd like to > > > >> continue > > > >>>>>>> the > > > >>>>>>>>>> discussion on this FLIP. > > > >>>>>>>>>> > > > >>>>>>>>>> I updated the FLIP-33 wiki to remove all the histogram > metrics > > > >> from > > > >>>>> the > > > >>>>>>>>>> first version of metric convention due to the performance > > > concern. > > > >>>>> The > > > >>>>>>>>> plan > > > >>>>>>>>>> is to introduce them later when we have a mechanism to opt > > > in/out > > > >>>>>>>>> metrics. > > > >>>>>>>>>> At that point, users can decide whether they want to pay the > > > cost > > > >> to > > > >>>>>>> get > > > >>>>>>>>>> the metric or not. > > > >>>>>>>>>> > > > >>>>>>>>>> As Stephan suggested, for this FLIP, let's first try to > agree > > on > > > >> the > > > >>>>>>>>> small > > > >>>>>>>>>> list of conventional metrics that connectors should follow. > > > >>>>>>>>>> Just to be clear, the purpose of the convention is not to > > > enforce > > > >>>>> every > > > >>>>>>>>>> connector to report all these metrics, but to provide a > > guidance > > > >> in > > > >>>>>>> case > > > >>>>>>>>>> these metrics are reported by some connectors. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> @ Stephan & Chesnay, > > > >>>>>>>>>> > > > >>>>>>>>>> Regarding the duplication of `RecordsIn` metric in operator > / > > > task > > > >>>>>>>>>> IOMetricGroups, from what I understand, for source operator, > > it > > > is > > > >>>>>>>>> actually > > > >>>>>>>>>> the SourceFunction that reports the operator level > > > >>>>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are essentially > > the > > > >>>>> same > > > >>>>>>>>>> metric in the operator level IOMetricGroup. Similarly for > the > > > Sink > > > >>>>>>>>>> operator, the operator level RecordsOut/RecordsOutPerSecond > > > >> metrics > > > >>>>> are > > > >>>>>>>>>> also reported by the Sink function. I marked them as > existing > > in > > > >> the > > > >>>>>>>>>> FLIP-33 wiki page. Please let me know if I misunderstood. > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks, > > > >>>>>>>>>> > > > >>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin < > > > becket....@gmail.com> > > > >>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Hi Piotr, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks a lot for the feedback. > > > >>>>>>>>>>> > > > >>>>>>>>>>> 1a) I guess you are referring to the part that "original > > system > > > >>>>>>> specific > > > >>>>>>>>>>> metrics should also be reported". The performance impact > > > depends > > > >> on > > > >>>>>>> the > > > >>>>>>>>>>> implementation. An efficient implementation would only > record > > > the > > > >>>>>>> metric > > > >>>>>>>>>>> once, but report them with two different metric names. This > > is > > > >>>>>>> unlikely > > > >>>>>>>>> to > > > >>>>>>>>>>> hurt performance. > > > >>>>>>>>>>> > > > >>>>>>>>>>> 1b) Yes, I agree that we should avoid adding overhead to > the > > > >>>>> critical > > > >>>>>>>>> path > > > >>>>>>>>>>> by all means. This is sometimes a tradeoff, running blindly > > > >> without > > > >>>>>>> any > > > >>>>>>>>>>> metric gives best performance, but sometimes might be > > > frustrating > > > >>>>> when > > > >>>>>>>>> we > > > >>>>>>>>>>> debug some issues. > > > >>>>>>>>>>> > > > >>>>>>>>>>> 2. The metrics are indeed very useful. Are they supposed to > > be > > > >>>>>>> reported > > > >>>>>>>>> by > > > >>>>>>>>>>> the connectors or Flink itself? At this point FLIP-33 is > more > > > >>>>> focused > > > >>>>>>> on > > > >>>>>>>>>>> provide a guidance to the connector authors on the metrics > > > >>>>> reporting. > > > >>>>>>>>> That > > > >>>>>>>>>>> said, after FLIP-27, I think we should absolutely report > > these > > > >>>>> metrics > > > >>>>>>>>> in > > > >>>>>>>>>>> the abstract implementation. In any case, the metric > > convention > > > >> in > > > >>>>>>> this > > > >>>>>>>>>>> list are expected to evolve over time. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski < > > > >>>>> pi...@ververica.com> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> Hi, > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Thanks for the proposal and driving the effort here Becket > > :) > > > >> I’ve > > > >>>>>>> read > > > >>>>>>>>>>>> through the FLIP-33 [1], and here are couple of my > thoughts. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Big +1 for standardising the metric names between > > connectors, > > > it > > > >>>>> will > > > >>>>>>>>>>>> definitely help us and users a lot. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Issues/questions/things to discuss that I’ve thought of: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 1a. If we are about to duplicate some metrics, can this > > > become a > > > >>>>>>>>>>>> performance issue? > > > >>>>>>>>>>>> 1b. Generally speaking, we should make sure that > collecting > > > >> those > > > >>>>>>>>> metrics > > > >>>>>>>>>>>> is as non intrusive as possible, especially that they will > > > need > > > >>>>> to be > > > >>>>>>>>>>>> updated once per record. (They might be collected more > > rarely > > > >> with > > > >>>>>>> some > > > >>>>>>>>>>>> overhead, but the hot path of updating it per record will > > need > > > >> to > > > >>>>> be > > > >>>>>>> as > > > >>>>>>>>>>>> quick as possible). That includes both avoiding heavy > > > >> computation > > > >>>>> on > > > >>>>>>>>> per > > > >>>>>>>>>>>> record path: histograms?, measuring time for time based > > > metrics > > > >>>>> (per > > > >>>>>>>>>>>> second) (System.currentTimeMillis() depending on the > > > >>>>> implementation > > > >>>>>>> can > > > >>>>>>>>>>>> invoke a system call) > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us to > check > > > the > > > >>>>> cause > > > >>>>>>>>> of > > > >>>>>>>>>>>> back pressure: > > > >>>>>>>>>>>> a) for sources, length of input queue (in bytes? Or > boolean > > > >>>>>>>>>>>> hasSomethingl/isEmpty) > > > >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean > > > >>>>>>>>>>>> hasSomething/isEmpty) > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> a) is useful in a scenario when we are processing backlog > of > > > >>>>> records, > > > >>>>>>>>> all > > > >>>>>>>>>>>> of the internal Flink’s input/output network buffers are > > > empty, > > > >>>>> and > > > >>>>>>> we > > > >>>>>>>>> want > > > >>>>>>>>>>>> to check whether the external source system is the > > bottleneck > > > >>>>>>> (source’s > > > >>>>>>>>>>>> input queue will be empty), or if the Flink’s connector is > > the > > > >>>>>>>>> bottleneck > > > >>>>>>>>>>>> (source’s input queues will be full). > > > >>>>>>>>>>>> b) similar story. Backlog of records, but this time all of > > the > > > >>>>>>> internal > > > >>>>>>>>>>>> Flink’s input/ouput network buffers are full, and we want > o > > > >> check > > > >>>>>>>>> whether > > > >>>>>>>>>>>> the external sink system is the bottleneck (sink output > > queues > > > >> are > > > >>>>>>>>> full), > > > >>>>>>>>>>>> or if the Flink’s connector is the bottleneck (sink’s > output > > > >>>>> queues > > > >>>>>>>>> will be > > > >>>>>>>>>>>> empty) > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> It might be sometimes difficult to provide those metrics, > so > > > >> they > > > >>>>>>> could > > > >>>>>>>>>>>> be optional, but if we could provide them, it would be > > really > > > >>>>>>> helpful. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Piotrek > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> [1] > > > >>>>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>> > > > >>>>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics > > > >>>>>>>>>>>> < > > > >>>>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>> > > > >>>>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen <se...@apache.org > > > > > >> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> I think this sounds reasonable. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Let's keep the "reconfiguration without stopping the job" > > out > > > >> of > > > >>>>>>> this, > > > >>>>>>>>>>>>> because that would be a super big effort and if we > approach > > > >> that, > > > >>>>>>> then > > > >>>>>>>>>>>> in > > > >>>>>>>>>>>>> more generic way rather than specific to connector > metrics. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> I would suggest to look at the following things before > > > starting > > > >>>>> with > > > >>>>>>>>> any > > > >>>>>>>>>>>>> implementation work: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> - Try and find a committer to support this, otherwise it > > will > > > >> be > > > >>>>>>> hard > > > >>>>>>>>>>>> to > > > >>>>>>>>>>>>> make progress > > > >>>>>>>>>>>>> - Start with defining a smaller set of "core metrics" and > > > >> extend > > > >>>>> the > > > >>>>>>>>>>>> set > > > >>>>>>>>>>>>> later. I think that is easier than now blocking on > reaching > > > >>>>>>> consensus > > > >>>>>>>>>>>> on a > > > >>>>>>>>>>>>> large group of metrics. > > > >>>>>>>>>>>>> - Find a solution to the problem Chesnay mentioned, that > > the > > > >>>>>>> "records > > > >>>>>>>>>>>> in" > > > >>>>>>>>>>>>> metric is somehow overloaded and exists already in the IO > > > >> Metric > > > >>>>>>>>> group. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin < > > > >> becket....@gmail.com > > > >>>>>> > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Hi Stephan, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Thanks a lot for the feedback. All makes sense. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> It is a good suggestion to simply have an > > onRecord(numBytes, > > > >>>>>>>>> eventTime) > > > >>>>>>>>>>>>>> method for connector writers. It should meet most of the > > > >>>>>>>>> requirements, > > > >>>>>>>>>>>>>> individual > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> The configurable metrics feature is something really > > useful, > > > >>>>>>>>>>>> especially if > > > >>>>>>>>>>>>>> we can somehow make it dynamically configurable without > > > >> stopping > > > >>>>>>> the > > > >>>>>>>>>>>> jobs. > > > >>>>>>>>>>>>>> It might be better to make it a separate discussion > > because > > > it > > > >>>>> is a > > > >>>>>>>>>>>> more > > > >>>>>>>>>>>>>> generic feature instead of only for connectors. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> So in order to make some progress, in this FLIP we can > > limit > > > >> the > > > >>>>>>>>>>>> discussion > > > >>>>>>>>>>>>>> scope to the connector related items: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> - the standard connector metric names and types. > > > >>>>>>>>>>>>>> - the abstract ConnectorMetricHandler interface > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> I'll start a separate thread to discuss other general > > metric > > > >>>>>>> related > > > >>>>>>>>>>>>>> enhancement items including: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> - optional metrics > > > >>>>>>>>>>>>>> - dynamic metric configuration > > > >>>>>>>>>>>>>> - potential combination with rate limiter > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Does this plan sound reasonable? > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen < > > > >> se...@apache.org> > > > >>>>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Ignoring for a moment implementation details, this > > > connector > > > >>>>>>> metrics > > > >>>>>>>>>>>> work > > > >>>>>>>>>>>>>>> is a really good thing to do, in my opinion > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> The questions "oh, my job seems to be doing nothing, I > am > > > >>>>> looking > > > >>>>>>> at > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>>>> UI > > > >>>>>>>>>>>>>>> and the 'records in' value is still zero" is in the top > > > three > > > >>>>>>>>> support > > > >>>>>>>>>>>>>>> questions I have been asked personally. > > > >>>>>>>>>>>>>>> Introspection into "how far is the consumer lagging > > behind" > > > >>>>> (event > > > >>>>>>>>>>>> time > > > >>>>>>>>>>>>>>> fetch latency) came up many times as well. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> So big +1 to solving this problem. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> About the exact design - I would try to go for the > > > following > > > >>>>>>>>>>>> properties: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> - keep complexity of of connectors. Ideally the metrics > > > >> handler > > > >>>>>>> has > > > >>>>>>>>> a > > > >>>>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or so, and > > > >>>>> everything > > > >>>>>>>>>>>> else is > > > >>>>>>>>>>>>>>> internal to the handler. That makes it dead simple for > > the > > > >>>>>>>>> connector. > > > >>>>>>>>>>>> We > > > >>>>>>>>>>>>>>> can also think of an extensive scheme for connector > > > specific > > > >>>>>>>>> metrics. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> - make it configurable on the job it cluster level > which > > > >>>>> metrics > > > >>>>>>> the > > > >>>>>>>>>>>>>>> handler internally creates when that method is invoked. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> What do you think? > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Best, > > > >>>>>>>>>>>>>>> Stephan > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay Schepler < > > > >>>>>>>>> ches...@apache.org > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> As I said before, I believe this to be over-engineered > > and > > > >>>>> have > > > >>>>>>> no > > > >>>>>>>>>>>>>>>> interest in this implementation. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> There are conceptual issues like defining a duplicate > > > >>>>>>>>>>>>>> numBytesIn(PerSec) > > > >>>>>>>>>>>>>>>> metric that already exists for each operator. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote: > > > >>>>>>>>>>>>>>>>> A few updates to the thread. I uploaded a patch[1] > as a > > > >>>>> complete > > > >>>>>>>>>>>>>>>>> example of how users can use the metrics in the > future. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Some thoughts below after taking a look at the > > > >>>>>>> AbstractMetricGroup > > > >>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>> its subclasses. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> This patch intends to provide convenience for Flink > > > >> connector > > > >>>>>>>>>>>>>>>>> implementations to follow metrics standards proposed > in > > > >>>>> FLIP-33. > > > >>>>>>>>> It > > > >>>>>>>>>>>>>>>>> also try to enhance the metric management in general > > way > > > to > > > >>>>> help > > > >>>>>>>>>>>>>> users > > > >>>>>>>>>>>>>>>>> with: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 1. metric definition > > > >>>>>>>>>>>>>>>>> 2. metric dependencies check > > > >>>>>>>>>>>>>>>>> 3. metric validation > > > >>>>>>>>>>>>>>>>> 4. metric control (turn on / off particular metrics) > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the > > > functionality > > > >> of > > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses. The > > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the metric > group > > > >>>>>>> hierarchy, > > > >>>>>>>>>>>> but > > > >>>>>>>>>>>>>>>>> does not really manage the metrics other than keeping > > > them > > > >>>>> in a > > > >>>>>>>>> Map. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Ideally we should only have one entry point for the > > > >> metrics. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Right now the entry point is |AbstractMetricGroup|. > > > >> However, > > > >>>>>>>>> besides > > > >>>>>>>>>>>>>>>>> the missing functionality mentioned above, > > > >>>>> |AbstractMetricGroup| > > > >>>>>>>>>>>>>> seems > > > >>>>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could extract it > out > > > to > > > >>>>>>>>>>>>>>>>> flink-metrics in order to use it for generic purpose. > > > There > > > >>>>> will > > > >>>>>>>>> be > > > >>>>>>>>>>>>>>>>> some work, though. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Another approach is to make |AbstractMetrics| in this > > > patch > > > >>>>> as > > > >>>>>>> the > > > >>>>>>>>>>>>>>>>> metric entry point. It wraps metric group and > provides > > > the > > > >>>>>>> missing > > > >>>>>>>>>>>>>>>>> functionalities. Then we can roll out this pattern to > > > >> runtime > > > >>>>>>>>>>>>>>>>> components gradually as well. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> My first thought is that the latter approach gives a > > more > > > >>>>> smooth > > > >>>>>>>>>>>>>>>>> migration. But I am also OK with doing a refactoring > on > > > the > > > >>>>>>>>>>>>>>>>> |AbstractMetricGroup| family. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> [1] https://github.com/becketqin/flink/pull/1 > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin < > > > >>>>>>> becket....@gmail.com > > > >>>>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Hi Chesnay, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> It might be easier to discuss some implementation > > details > > > >> in > > > >>>>>>> the > > > >>>>>>>>>>>>>>>>> PR review instead of in the FLIP discussion thread. I > > > have > > > >> a > > > >>>>>>>>>>>>>> patch > > > >>>>>>>>>>>>>>>>> for Kafka connectors ready but haven't submitted the > PR > > > >> yet. > > > >>>>>>>>>>>>>>>>> Hopefully that will help explain a bit more. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> ** Re: metric type binding > > > >>>>>>>>>>>>>>>>> This is a valid point that worths discussing. If I > > > >> understand > > > >>>>>>>>>>>>>>>>> correctly, there are two points: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 1. Metric type / interface does not matter as long as > > the > > > >>>>>>> metric > > > >>>>>>>>>>>>>>>>> semantic is clearly defined. > > > >>>>>>>>>>>>>>>>> Conceptually speaking, I agree that as long as the > > metric > > > >>>>>>>>>>>>>> semantic > > > >>>>>>>>>>>>>>>>> is defined, metric type does not matter. To some > > extent, > > > >>>>> Gauge > > > >>>>>>> / > > > >>>>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can be think > of > > as > > > >>>>> some > > > >>>>>>>>>>>>>>>>> well-recognized semantics, if you wish. In Flink, > these > > > >>>>> metric > > > >>>>>>>>>>>>>>>>> semantics have their associated interface classes. In > > > >>>>> practice, > > > >>>>>>>>>>>>>>>>> such semantic to interface binding seems necessary > for > > > >>>>>>> different > > > >>>>>>>>>>>>>>>>> components to communicate. Simply standardize the > > > semantic > > > >>>>> of > > > >>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> connector metrics seems not sufficient for people to > > > build > > > >>>>>>>>>>>>>>>>> ecosystem on top of. At the end of the day, we still > > need > > > >> to > > > >>>>>>>>> have > > > >>>>>>>>>>>>>>>>> some embodiment of the metric semantics that people > can > > > >>>>> program > > > >>>>>>>>>>>>>>>>> against. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be exposed > > > using > > > >>>>>>>>>>>>>>>>> different metric types / interfaces. > > > >>>>>>>>>>>>>>>>> This is a good point. Counter and Gauge-as-a-Counter > > are > > > >>>>> pretty > > > >>>>>>>>>>>>>>>>> much interchangeable. This is more of a trade-off > > between > > > >> the > > > >>>>>>>>>>>>>> user > > > >>>>>>>>>>>>>>>>> experience of metric producers and consumers. The > > metric > > > >>>>>>>>>>>>>> producers > > > >>>>>>>>>>>>>>>>> want to use Counter or Gauge depending on whether the > > > >> counter > > > >>>>>>> is > > > >>>>>>>>>>>>>>>>> already tracked in code, while ideally the metric > > > consumers > > > >>>>>>> only > > > >>>>>>>>>>>>>>>>> want to see a single metric type for each metric. I > am > > > >>>>> leaning > > > >>>>>>>>>>>>>>>>> towards to make the metric producers happy, i.e. > allow > > > >> Gauge > > > >>>>> / > > > >>>>>>>>>>>>>>>>> Counter metric type, and the the metric consumers > > handle > > > >> the > > > >>>>>>>>> type > > > >>>>>>>>>>>>>>>>> variation. The reason is that in practice, there > might > > be > > > >>>>> more > > > >>>>>>>>>>>>>>>>> connector implementations than metric reporter > > > >>>>> implementations. > > > >>>>>>>>>>>>>> We > > > >>>>>>>>>>>>>>>>> could also provide some helper method to facilitate > > > reading > > > >>>>>>> from > > > >>>>>>>>>>>>>>>>> such variable metric type. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Just some quick replies to the comments around > > > >> implementation > > > >>>>>>>>>>>>>>>> details. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 4) single place where metrics are registered except > > > >>>>>>>>>>>>>>>>> connector-specific > > > >>>>>>>>>>>>>>>>> ones (which we can't really avoid). > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Register connector specific ones in a single place is > > > >>>>> actually > > > >>>>>>>>>>>>>>>>> something that I want to achieve. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 2) I'm talking about time-series databases like > > > >>>>> Prometheus. > > > >>>>>>>>>>>>>> We > > > >>>>>>>>>>>>>>>>> would > > > >>>>>>>>>>>>>>>>> only have a gauge metric exposing the last > > > >>>>>>>>> fetchTime/emitTime > > > >>>>>>>>>>>>>>>>> that is > > > >>>>>>>>>>>>>>>>> regularly reported to the backend (Prometheus), > > where a > > > >>>>>>> user > > > >>>>>>>>>>>>>>>>> could build > > > >>>>>>>>>>>>>>>>> a histogram of his choosing when/if he wants it. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Not sure if such downsampling works. As an example, > if > > a > > > >> user > > > >>>>>>>>>>>>>>>>> complains that there are some intermittent latency > > spikes > > > >>>>>>> (maybe > > > >>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>> few records in 10 seconds) in their processing > system. > > > >>>>> Having a > > > >>>>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems unlikely > > > useful. > > > >>>>>>>>>>>>>>>>> However by looking at actual 99.9 percentile latency > > > might > > > >>>>>>> help. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay Schepler > > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> > > wrote: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Re: over complication of implementation. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> I think I get understand better know what you're > > > >> shooting > > > >>>>>>>>>>>>>> for, > > > >>>>>>>>>>>>>>>>> effectively something like the > OperatorIOMetricGroup. > > > >>>>>>>>>>>>>>>>> But still, re-define setupConnectorMetrics() to > > accept > > > a > > > >>>>>>> set > > > >>>>>>>>>>>>>>>>> of flags > > > >>>>>>>>>>>>>>>>> for counters/meters(ans _possibly_ histograms) > along > > > >>>>> with a > > > >>>>>>>>>>>>>>>>> set of > > > >>>>>>>>>>>>>>>>> well-defined Optional<Gauge<?>>, and return the > > group. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Solves all issues as far as i can tell: > > > >>>>>>>>>>>>>>>>> 1) no metrics must be created manually (except > > Gauges, > > > >>>>>>> which > > > >>>>>>>>>>>>>>> are > > > >>>>>>>>>>>>>>>>> effectively just Suppliers and you can't get around > > > >>>>> this), > > > >>>>>>>>>>>>>>>>> 2) additional metrics can be registered on the > > returned > > > >>>>>>>>>>>>>> group, > > > >>>>>>>>>>>>>>>>> 3) see 1), > > > >>>>>>>>>>>>>>>>> 4) single place where metrics are registered except > > > >>>>>>>>>>>>>>>>> connector-specific > > > >>>>>>>>>>>>>>>>> ones (which we can't really avoid). > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Re: Histogram > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 1) As an example, whether "numRecordsIn" is exposed > > as > > > a > > > >>>>>>>>>>>>>>>>> Counter or a > > > >>>>>>>>>>>>>>>>> Gauge should be irrelevant. So far we're using the > > > >> metric > > > >>>>>>>>>>>>>> type > > > >>>>>>>>>>>>>>>>> that is > > > >>>>>>>>>>>>>>>>> the most convenient at exposing a given value. If > > there > > > >>>>> is > > > >>>>>>>>>>>>>>>>> some backing > > > >>>>>>>>>>>>>>>>> data-structure that we want to expose some data > from > > we > > > >>>>>>>>>>>>>>>>> typically opt > > > >>>>>>>>>>>>>>>>> for a Gauge, as otherwise we're just mucking around > > > with > > > >>>>>>> the > > > >>>>>>>>>>>>>>>>> Meter/Counter API to get it to match. Similarly, if > > we > > > >>>>> want > > > >>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>> count > > > >>>>>>>>>>>>>>>>> something but no current count exists we typically > > > added > > > >>>>> a > > > >>>>>>>>>>>>>>>>> Counter. > > > >>>>>>>>>>>>>>>>> That's why attaching semantics to metric types > makes > > > >>>>> little > > > >>>>>>>>>>>>>>>>> sense (but > > > >>>>>>>>>>>>>>>>> unfortunately several reporters already do it); for > > > >>>>>>>>>>>>>>>>> counters/meters > > > >>>>>>>>>>>>>>>>> certainly, but the majority of metrics are gauges. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 2) I'm talking about time-series databases like > > > >>>>> Prometheus. > > > >>>>>>>>>>>>>> We > > > >>>>>>>>>>>>>>>>> would > > > >>>>>>>>>>>>>>>>> only have a gauge metric exposing the last > > > >>>>>>>>> fetchTime/emitTime > > > >>>>>>>>>>>>>>>>> that is > > > >>>>>>>>>>>>>>>>> regularly reported to the backend (Prometheus), > > where a > > > >>>>>>> user > > > >>>>>>>>>>>>>>>>> could build > > > >>>>>>>>>>>>>>>>> a histogram of his choosing when/if he wants it. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On 22.02.2019 13:57, Becket Qin wrote: > > > >>>>>>>>>>>>>>>>>> Hi Chesnay, > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Thanks for the explanation. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> ** Re: FLIP > > > >>>>>>>>>>>>>>>>>> I might have misunderstood this, but it seems that > > > "major > > > >>>>>>>>>>>>>>>>> changes" are well > > > >>>>>>>>>>>>>>>>>> defined in FLIP. The full contents is following: > > > >>>>>>>>>>>>>>>>>> What is considered a "major change" that needs a > FLIP? > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Any of the following should be considered a major > > > change: > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> - Any major new feature, subsystem, or piece of > > > >>>>>>>>>>>>>>>>> functionality > > > >>>>>>>>>>>>>>>>>> - *Any change that impacts the public interfaces of > > the > > > >>>>>>>>>>>>>>>>> project* > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> What are the "public interfaces" of the project? > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> *All of the following are public interfaces *that > > people > > > >>>>>>>>>>>>>>>>> build around: > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> - DataStream and DataSet API, including classes > > related > > > >>>>>>>>>>>>>>>>> to that, such as > > > >>>>>>>>>>>>>>>>>> StreamExecutionEnvironment > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> - Classes marked with the @Public annotation > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> - On-disk binary formats, such as > > > >>>>>>>>>>>>>> checkpoints/savepoints > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> - User-facing scripts/command-line tools, i.e. > > > >>>>>>>>>>>>>>>>> bin/flink, Yarn scripts, > > > >>>>>>>>>>>>>>>>>> Mesos scripts > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> - Configuration settings > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> - *Exposed monitoring information* > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> So any monitoring information change is considered > as > > > >>>>>>>>>>>>>> public > > > >>>>>>>>>>>>>>>>> interface, and > > > >>>>>>>>>>>>>>>>>> any public interface change is considered as a > "major > > > >>>>>>>>>>>>>>> change". > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> ** Re: over complication of implementation. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Although this is more of implementation details that > > is > > > >> not > > > >>>>>>>>>>>>>>>>> covered by the > > > >>>>>>>>>>>>>>>>>> FLIP. But it may be worth discussing. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> First of all, I completely agree that we should use > > the > > > >>>>>>>>>>>>>>>>> simplest way to > > > >>>>>>>>>>>>>>>>>> achieve our goal. To me the goal is the following: > > > >>>>>>>>>>>>>>>>>> 1. Clear connector conventions and interfaces. > > > >>>>>>>>>>>>>>>>>> 2. The easiness of creating a connector. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Both of them are important to the prosperity of the > > > >>>>>>>>>>>>>>>>> connector ecosystem. So > > > >>>>>>>>>>>>>>>>>> I'd rather abstract as much as possible on our side > to > > > >> make > > > >>>>>>>>>>>>>>>>> the connector > > > >>>>>>>>>>>>>>>>>> developer's work lighter. Given this goal, a static > > util > > > >>>>>>>>>>>>>>>>> method approach > > > >>>>>>>>>>>>>>>>>> might have a few drawbacks: > > > >>>>>>>>>>>>>>>>>> 1. Users still have to construct the metrics by > > > >> themselves. > > > >>>>>>>>>>>>>>>>> (And note that > > > >>>>>>>>>>>>>>>>>> this might be erroneous by itself. For example, a > > > customer > > > >>>>>>>>>>>>>>>>> wrapper around > > > >>>>>>>>>>>>>>>>>> dropwizard meter maybe used instead of MeterView). > > > >>>>>>>>>>>>>>>>>> 2. When connector specific metrics are added, it is > > > >>>>>>>>>>>>>>>>> difficult to enforce > > > >>>>>>>>>>>>>>>>>> the scope to be the same as standard metrics. > > > >>>>>>>>>>>>>>>>>> 3. It seems that a method proliferation is > inevitable > > if > > > >> we > > > >>>>>>>>>>>>>>>>> want to apply > > > >>>>>>>>>>>>>>>>>> sanity checks. e.g. The metric of numBytesIn was not > > > >>>>>>>>>>>>>>>>> registered for a meter. > > > >>>>>>>>>>>>>>>>>> 4. Metrics are still defined in random places and > hard > > > to > > > >>>>>>>>>>>>>>>> track. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> The current PR I had was inspired by the Config > system > > > in > > > >>>>>>>>>>>>>>>>> Kafka, which I > > > >>>>>>>>>>>>>>>>>> found pretty handy. In fact it is not only used by > > Kafka > > > >>>>>>>>>>>>>>>>> itself but even > > > >>>>>>>>>>>>>>>>>> some other projects that depend on Kafka. I am not > > > saying > > > >>>>>>>>>>>>>>>>> this approach is > > > >>>>>>>>>>>>>>>>>> perfect. But I think it worths to save the work for > > > >>>>>>>>>>>>>>>>> connector writers and > > > >>>>>>>>>>>>>>>>>> encourage more systematic implementation. That being > > > said, > > > >>>>>>>>>>>>>> I > > > >>>>>>>>>>>>>>>>> am fully open > > > >>>>>>>>>>>>>>>>>> to suggestions. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Re: Histogram > > > >>>>>>>>>>>>>>>>>> I think there are two orthogonal questions around > > those > > > >>>>>>>>>>>>>>>> metrics: > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just looking at > > the > > > >>>>>>>>>>>>>>>>> meaning of a > > > >>>>>>>>>>>>>>>>>> metric, is generic to all connectors? If the answer > is > > > >> yes, > > > >>>>>>>>>>>>>>>>> we should > > > >>>>>>>>>>>>>>>>>> include the metric into the convention. No matter > > > whether > > > >>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>> include it > > > >>>>>>>>>>>>>>>>>> into the convention or not, some connector > > > implementations > > > >>>>>>>>>>>>>>>>> will emit such > > > >>>>>>>>>>>>>>>>>> metric. It is better to have a convention than > letting > > > >> each > > > >>>>>>>>>>>>>>>>> connector do > > > >>>>>>>>>>>>>>>>>> random things. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> 2. If a standard metric is a histogram, what should > we > > > do? > > > >>>>>>>>>>>>>>>>>> I agree that we should make it clear that using > > > histograms > > > >>>>>>>>>>>>>>>>> will have > > > >>>>>>>>>>>>>>>>>> performance risk. But I do see histogram is useful > in > > > some > > > >>>>>>>>>>>>>>>>> fine-granularity > > > >>>>>>>>>>>>>>>>>> debugging where one do not have the luxury to stop > the > > > >>>>>>>>>>>>>>>>> system and inject > > > >>>>>>>>>>>>>>>>>> more inspection code. So the workaround I am > thinking > > is > > > >> to > > > >>>>>>>>>>>>>>>>> provide some > > > >>>>>>>>>>>>>>>>>> implementation suggestions. Assume later on we have > a > > > >>>>>>>>>>>>>>>>> mechanism of > > > >>>>>>>>>>>>>>>>>> selective metrics. In the abstract metrics class we > > can > > > >>>>>>>>>>>>>>>>> disable those > > > >>>>>>>>>>>>>>>>>> metrics by default individual connector writers does > > not > > > >>>>>>>>>>>>>>>>> have to do > > > >>>>>>>>>>>>>>>>>> anything (this is another advantage of having an > > > >>>>>>>>>>>>>>>>> AbstractMetrics instead of > > > >>>>>>>>>>>>>>>>>> static util methods.) > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> I am not sure I fully understand the histogram in > the > > > >>>>>>>>>>>>>>>>> backend approach. Can > > > >>>>>>>>>>>>>>>>>> you explain a bit more? Do you mean emitting the raw > > > data, > > > >>>>>>>>>>>>>>>>> e.g. fetchTime > > > >>>>>>>>>>>>>>>>>> and emitTime with each record and let the histogram > > > >>>>>>>>>>>>>>>>> computation happen in > > > >>>>>>>>>>>>>>>>>> the background? Or let the processing thread putting > > the > > > >>>>>>>>>>>>>>>>> values into a > > > >>>>>>>>>>>>>>>>>> queue and have a separate thread polling from the > > queue > > > >> and > > > >>>>>>>>>>>>>>>>> add them into > > > >>>>>>>>>>>>>>>>>> the histogram? > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler > > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> > > > wrote: > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> Re: Flip > > > >>>>>>>>>>>>>>>>>>> The very first line under both the main header and > > > >> Purpose > > > >>>>>>>>>>>>>>>>> section > > > >>>>>>>>>>>>>>>>>>> describe Flips as "major changes", which this > isn't. > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> Re: complication > > > >>>>>>>>>>>>>>>>>>> I'm not arguing against standardization, but again > an > > > >>>>>>>>>>>>>>>>> over-complicated > > > >>>>>>>>>>>>>>>>>>> implementation when a static utility method would > be > > > >>>>>>>>>>>>>>>>> sufficient. > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> public static void setupConnectorMetrics( > > > >>>>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup, > > > >>>>>>>>>>>>>>>>>>> String connectorName, > > > >>>>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn, > > > >>>>>>>>>>>>>>>>>>> ...) > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> This gives you all you need: > > > >>>>>>>>>>>>>>>>>>> * a well-defined set of metrics for a connector to > > > opt-in > > > >>>>>>>>>>>>>>>>>>> * standardized naming schemes for scope and > > individual > > > >>>>>>>>>>>>>>> metrics > > > >>>>>>>>>>>>>>>>>>> * standardize metric types (although personally I'm > > not > > > >>>>>>>>>>>>>>>>> interested in that > > > >>>>>>>>>>>>>>>>>>> since metric types should be considered syntactic > > > sugar) > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> Re: Configurable Histogram > > > >>>>>>>>>>>>>>>>>>> If anything they _must_ be turned off by default, > but > > > the > > > >>>>>>>>>>>>>>>>> metric system is > > > >>>>>>>>>>>>>>>>>>> already exposing so many options that I'm not too > > keen > > > on > > > >>>>>>>>>>>>>>>>> adding even more. > > > >>>>>>>>>>>>>>>>>>> You have also only addressed my first argument > > against > > > >>>>>>>>>>>>>>>>> histograms > > > >>>>>>>>>>>>>>>>>>> (performance), the second one still stands > (calculate > > > >>>>>>>>>>>>>>>>> histogram in metric > > > >>>>>>>>>>>>>>>>>>> backends instead). > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote: > > > >>>>>>>>>>>>>>>>>>>> Hi Chesnay, > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is worthy > of a > > > >> FLIP > > > >>>>>>>>>>>>>>>>> because it is > > > >>>>>>>>>>>>>>>>>>>> public API. According to the FLIP description a > FlIP > > > is > > > >>>>>>>>>>>>>>>>> required in case > > > >>>>>>>>>>>>>>>>>>> of: > > > >>>>>>>>>>>>>>>>>>>> - Any change that impacts the public interfaces of > > > >>>>>>>>>>>>>>>>> the project > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> and the following entry is found in the definition > > of > > > >>>>>>>>>>>>>>>>> "public interface". > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> - Exposed monitoring information > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Metrics are critical to any production system. So > a > > > >> clear > > > >>>>>>>>>>>>>>>>> metric > > > >>>>>>>>>>>>>>>>>>> definition > > > >>>>>>>>>>>>>>>>>>>> is important for any serious users. For an > > > organization > > > >>>>>>>>>>>>>>>>> with large Flink > > > >>>>>>>>>>>>>>>>>>>> installation, change in metrics means great amount > > of > > > >>>>>>>>>>>>>>>>> work. So such > > > >>>>>>>>>>>>>>>>>>> changes > > > >>>>>>>>>>>>>>>>>>>> do need to be fully discussed and documented. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> ** Re: Histogram. > > > >>>>>>>>>>>>>>>>>>>> We can discuss whether there is a better way to > > expose > > > >>>>>>>>>>>>>>>>> metrics that are > > > >>>>>>>>>>>>>>>>>>>> suitable for histograms. My micro-benchmark on > > various > > > >>>>>>>>>>>>>>>>> histogram > > > >>>>>>>>>>>>>>>>>>>> implementations also indicates that they are > > > >>>>>>>>>>>>>> significantly > > > >>>>>>>>>>>>>>>>> slower than > > > >>>>>>>>>>>>>>>>>>>> other metric types. But I don't think that means > > never > > > >>>>>>>>>>>>>> use > > > >>>>>>>>>>>>>>>>> histogram, but > > > >>>>>>>>>>>>>>>>>>>> means use it with caution. For example, we can > > suggest > > > >>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>> implementations > > > >>>>>>>>>>>>>>>>>>>> to turn them off by default and only turn it on > for > > a > > > >>>>>>>>>>>>>>>>> small amount of > > > >>>>>>>>>>>>>>>>>>> time > > > >>>>>>>>>>>>>>>>>>>> when performing some micro-debugging. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> ** Re: complication: > > > >>>>>>>>>>>>>>>>>>>> Connector conventions are essential for Flink > > > ecosystem. > > > >>>>>>>>>>>>>>>>> Flink connectors > > > >>>>>>>>>>>>>>>>>>>> pool is probably the most important part of Flink, > > > just > > > >>>>>>>>>>>>>>>>> like any other > > > >>>>>>>>>>>>>>>>>>> data > > > >>>>>>>>>>>>>>>>>>>> system. Clear conventions of connectors will help > > > build > > > >>>>>>>>>>>>>>>>> Flink ecosystem > > > >>>>>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>>>> a more organic way. > > > >>>>>>>>>>>>>>>>>>>> Take the metrics convention as an example, imagine > > > >>>>>>>>>>>>>> someone > > > >>>>>>>>>>>>>>>>> has developed > > > >>>>>>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>>> Flink connector for System foo, and another > > developer > > > >> may > > > >>>>>>>>>>>>>>>>> have developed > > > >>>>>>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for Flink > which > > > >>>>>>>>>>>>>>>>> analyzes the Flink > > > >>>>>>>>>>>>>>>>>>> job > > > >>>>>>>>>>>>>>>>>>>> performance based on metrics. With a clear metric > > > >>>>>>>>>>>>>>>>> convention, those two > > > >>>>>>>>>>>>>>>>>>>> projects could be developed independently. Once > > users > > > >> put > > > >>>>>>>>>>>>>>>>> them together, > > > >>>>>>>>>>>>>>>>>>>> it would work without additional modifications. > This > > > >>>>>>>>>>>>>>>>> cannot be easily > > > >>>>>>>>>>>>>>>>>>>> achieved by just defining a few constants. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> ** Re: selective metrics: > > > >>>>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate thread. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> @Dawid > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency > > > >>>>>>>>>>>>>>>>>>>> The primary purpose of establish such a convention > > is > > > to > > > >>>>>>>>>>>>>>>>> help developers > > > >>>>>>>>>>>>>>>>>>>> write connectors in a more compatible way. The > > > >> convention > > > >>>>>>>>>>>>>>>>> is supposed to > > > >>>>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>> defined more proactively. So when look at the > > > >> convention, > > > >>>>>>>>>>>>>>>>> it seems more > > > >>>>>>>>>>>>>>>>>>>> important to see if the concept is applicable to > > > >>>>>>>>>>>>>>>>> connectors in general. > > > >>>>>>>>>>>>>>>>>>> It > > > >>>>>>>>>>>>>>>>>>>> might be true so far only Kafka connector reports > > > >>>>>>>>>>>>>> latency. > > > >>>>>>>>>>>>>>>>> But there > > > >>>>>>>>>>>>>>>>>>> might > > > >>>>>>>>>>>>>>>>>>>> be hundreds of other connector implementations in > > the > > > >>>>>>>>>>>>>>>>> Flink ecosystem, > > > >>>>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of them > also > > > >> emits > > > >>>>>>>>>>>>>>>>> latency. I > > > >>>>>>>>>>>>>>>>>>> think > > > >>>>>>>>>>>>>>>>>>>> a lot of other sources actually also has an append > > > >>>>>>>>>>>>>>>>> timestamp. e.g. > > > >>>>>>>>>>>>>>>>>>> database > > > >>>>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I wouldn't be > > > surprised > > > >>>>>>>>>>>>>>>>> if some database > > > >>>>>>>>>>>>>>>>>>>> connector can also emit latency metrics. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler > > > >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> > > > >>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to investigate > > > this > > > >>>>>>>>>>>>>> as > > > >>>>>>>>>>>>>>>>> part of this > > > >>>>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of interest > > for > > > >>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> entire metric > > > >>>>>>>>>>>>>>>>>>>>> system, and should be designed for as such. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole: > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to the core > of > > > >>>>>>>>>>>>>>>>> Flink. They are > > > >>>>>>>>>>>>>>>>>>>>> significantly more expensive than other metrics, > > and > > > >>>>>>>>>>>>>>>>> calculating > > > >>>>>>>>>>>>>>>>>>>>> histograms in the application is regarded as an > > > >>>>>>>>>>>>>>>>> anti-pattern by several > > > >>>>>>>>>>>>>>>>>>>>> metric backends, who instead recommend to expose > > the > > > >> raw > > > >>>>>>>>>>>>>>>>> data and > > > >>>>>>>>>>>>>>>>>>>>> calculate the histogram in the backend. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Second, this seems overly complicated. Given that > > we > > > >>>>>>>>>>>>>>>>> already established > > > >>>>>>>>>>>>>>>>>>>>> that not all connectors will export all metrics > we > > > are > > > >>>>>>>>>>>>>>>>> effectively > > > >>>>>>>>>>>>>>>>>>>>> reducing this down to a consistent naming scheme. > > We > > > >>>>>>>>>>>>>>>>> don't need anything > > > >>>>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a few > > > constants > > > >>>>>>>>>>>>>>>>> that all > > > >>>>>>>>>>>>>>>>>>>>> connectors use. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of a FLIP. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz wrote: > > > >>>>>>>>>>>>>>>>>>>>>> Hi, > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I agree. But > > those > > > >>>>>>>>>>>>>>>>> particular metrics > > > >>>>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now would only be > > > >>>>>>>>>>>>>>>>> reported if user uses > > > >>>>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal timestampAssigner > with > > > >>>>>>>>>>>>>>>>> StreamCharacteristic > > > >>>>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds like a very > > > >>>>>>>>>>>>>>>>> specific case. I am > > > >>>>>>>>>>>>>>>>>>> not > > > >>>>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic metric > that > > > will > > > >>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>>> disabled/absent for most of implementations. > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal issue, that > > > might > > > >>>>>>>>>>>>>>>>> make sense to > > > >>>>>>>>>>>>>>>>>>>>>> investigate in the future. > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> Best, > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> Dawid > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote: > > > >>>>>>>>>>>>>>>>>>>>>>> Hi Dawid, > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes sense to > me. > > > >> There > > > >>>>>>>>>>>>>>>>> are two cases > > > >>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>>>> addressed. > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a guidance. > It > > is > > > >>>>>>>>>>>>>>>>> likely that a > > > >>>>>>>>>>>>>>>>>>>>> connector > > > >>>>>>>>>>>>>>>>>>>>>>> only supports some but not all of the metrics. > In > > > >> that > > > >>>>>>>>>>>>>>>>> case, each > > > >>>>>>>>>>>>>>>>>>>>> connector > > > >>>>>>>>>>>>>>>>>>>>>>> implementation should have the freedom to > decide > > > >> which > > > >>>>>>>>>>>>>>>>> metrics are > > > >>>>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are supported, > the > > > >>>>>>>>>>>>>>>>> guidance should be > > > >>>>>>>>>>>>>>>>>>>>>>> followed. > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable certain > > > >> metrics > > > >>>>>>>>>>>>>>>>> for some reason > > > >>>>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of data). A > > > generic > > > >>>>>>>>>>>>>>>>> mechanism should > > > >>>>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>>>> provided to allow user choose which metrics are > > > >>>>>>>>>>>>>>>>> reported. This > > > >>>>>>>>>>>>>>>>>>> mechanism > > > >>>>>>>>>>>>>>>>>>>>>>> should also be honored by the connector > > > >>>>>>>>>>>>>> implementations. > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you? > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid > Wysakowicz > > < > > > >>>>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto: > > > dwysakow...@apache.org > > > >>>> > > > >>>>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Hi, > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a unified, > > > >>>>>>>>>>>>>>>>> standard set of > > > >>>>>>>>>>>>>>>>>>> metrics > > > >>>>>>>>>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight concerns > > about > > > >>>>>>>>>>>>>>>>> fetchLatency and > > > >>>>>>>>>>>>>>>>>>>>>>>> latency though. They are computed based on > > > EventTime > > > >>>>>>>>>>>>>>>>> which is not a > > > >>>>>>>>>>>>>>>>>>>>> purely > > > >>>>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often on some > > > business > > > >>>>>>>>>>>>>>>>> logic, might be > > > >>>>>>>>>>>>>>>>>>>>> absent > > > >>>>>>>>>>>>>>>>>>>>>>>> or defined after source. Those metrics could > > also > > > >>>>>>>>>>>>>>>>> behave in a weird > > > >>>>>>>>>>>>>>>>>>>>> way in > > > >>>>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore I am not > > sure > > > >> if > > > >>>>>>>>>>>>>>>>> we should > > > >>>>>>>>>>>>>>>>>>> include > > > >>>>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we could at > > least > > > >>>>>>>>>>>>>>>>> introduce a feature > > > >>>>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think? > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Best, > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Dawid > > > >>>>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to the > proposed > > > >>>>>>>>>>>>>>>>> metrics. I'll start a > > > >>>>>>>>>>>>>>>>>>>>>>>> voting thread later toady. > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin > > > >>>>>>>>>>>>>>>>> <becket....@gmail.com <mailto:becket....@gmail.com > >> > > < > > > >>>>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto: > becket....@gmail.com > > >> > > > >>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>> Hi folks, > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP discussion > thread > > > >>>>>>>>>>>>>> about > > > >>>>>>>>>>>>>>>>> standardize > > > >>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>> connector metrics. > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a > convention > > of > > > >>>>>>>>>>>>>>>>> Flink connector > > > >>>>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the monitoring > > and > > > >>>>>>>>>>>>>>>>> alerting on Flink > > > >>>>>>>>>>>>>>>>>>>>> jobs. > > > >>>>>>>>>>>>>>>>>>>>>>>> The FLIP link is following: > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>> > > > >>>>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics > > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>> > > > >>>>> > > > >> > > > >> > > > > > > > > > > > -- > > Konstantin Knauf > > https://twitter.com/snntrable > > https://github.com/knaufk >