> we will need to revisit the convention list and adjust them accordingly when > FLIP-27 is ready
Yes, this sounds good :) Piotrek > On 13 Jun 2019, at 13:35, Becket Qin <becket....@gmail.com> wrote: > > Hi Piotr, > > That's great to know. Chances are that we will need to revisit the > convention list and adjust them accordingly when FLIP-27 is ready, At that > point we can mark some of the metrics as available by default for > connectors implementing the new interface. > > Thanks, > > Jiangjie (Becket) Qin > > On Thu, Jun 13, 2019 at 3:51 PM Piotr Nowojski <pi...@ververica.com> wrote: > >> Thanks for driving this. I’ve just noticed one small thing. With new >> SourceReader interface Flink will be able to provide `idleTime` metric >> automatically. >> >> Piotrek >> >>> On 13 Jun 2019, at 03:30, Becket Qin <becket....@gmail.com> wrote: >>> >>> Thanks all for the feedback and discussion. >>> >>> Since there wasn't any concern raised, I've started the voting thread for >>> this FLIP, but please feel free to continue the discussion here if you >>> think something still needs to be addressed. >>> >>> Thanks, >>> >>> Jiangjie (Becket) Qin >>> >>> >>> >>> On Mon, Jun 10, 2019 at 9:10 AM Becket Qin <becket....@gmail.com> wrote: >>> >>>> Hi Piotr, >>>> >>>> Thanks for the comments. Yes, you are right. Users will have to look at >>>> other metrics to decide whether the pipeline is healthy or not in the >> first >>>> place before they can use the time-based metric to fix the bottleneck. >>>> >>>> I agree that once we have FLIP-27 ready, some of the metrics can just be >>>> reported by the abstract implementation. >>>> >>>> I've updated FLIP-33 wiki page to add the pendingBytes and >> pendingRecords >>>> metric. Please let me know if you have any concern over the updated >> metric >>>> convention proposal. >>>> >>>> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen >>>> <step...@ververica.com> will you also have time to take a look at the >>>> proposed metric convention? If there is no further concern I'll start a >>>> voting thread for this FLIP in two days. >>>> >>>> Thanks, >>>> >>>> Jiangjie (Becket) Qin >>>> >>>> >>>> >>>> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski <pi...@ververica.com> >> wrote: >>>> >>>>> Hi Becket, >>>>> >>>>> Thanks for the answer :) >>>>> >>>>>> By time-based metric, I meant the portion of time spent on producing >> the >>>>>> record to downstream. For example, a source connector can report that >>>>> it's >>>>>> spending 80% of time to emit record to downstream processing pipeline. >>>>> In >>>>>> another case, a sink connector may report that its spending 30% of >> time >>>>>> producing the records to the external system. >>>>>> >>>>>> This is in some sense equivalent to the buffer usage metric: >>>>> >>>>>> - 80% of time spent on emitting records to downstream ---> downstream >>>>>> node is bottleneck ---> output buffer is probably full. >>>>>> - 30% of time spent on emitting records to downstream ---> downstream >>>>>> node is not bottleneck ---> output buffer is probably not full. >>>>> >>>>> If by “time spent on emitting records to downstream” you understand >>>>> “waiting on back pressure”, then I see your point. And I agree that >> some >>>>> kind of ratio/time based metric gives you more information. However >> under >>>>> “time spent on emitting records to downstream” might be hidden the >>>>> following (extreme) situation: >>>>> >>>>> 1. Job is barely able to handle influx of records, there is 99% >>>>> CPU/resource usage in the cluster, but nobody is >>>>> bottlenecked/backpressured, all output buffers are empty, everybody is >>>>> waiting in 1% of it’s time for more records to process. >>>>> 2. 80% time can still be spent on "down stream operators”, because they >>>>> are the CPU intensive operations, but this doesn’t mean that >> increasing the >>>>> parallelism down the stream will help with anything there. To the >> contrary, >>>>> increasing parallelism of the source operator might help to increase >>>>> resource utilisation up to 100%. >>>>> >>>>> However, this “time based/ratio” approach can be extended to in/output >>>>> buffer usage. Besides collecting an information that input/output >> buffer is >>>>> full/empty, we can probe profile how often are buffer empty/full. If >> output >>>>> buffer is full 1% of times, there is almost no back pressure. If it’s >> full >>>>> 80% of times, there is some back pressure, if it’s full 99.9% of times, >>>>> there is huge back pressure. >>>>> >>>>> Now for autoscaling you could compare the input & output buffers fill >>>>> ratio: >>>>> >>>>> 1. Both are high, the source of bottleneck is down the stream >>>>> 2. Output is low, input is high, this is the bottleneck and the higher >>>>> the difference, the bigger source of bottleneck is this is >> operator/task >>>>> 3. Output is high, input is low - there was some load spike that we are >>>>> currently finishing to process >>>>> >>>>> >>>>> >>>>> But long story short, we are probably diverging from the topic of this >>>>> discussion, and we can discuss this at some later point. >>>>> >>>>> For now, for sources: >>>>> >>>>> as I wrote before, +1 for: >>>>> - pending.bytes, Gauge >>>>> - pending.messages, Gauge >>>>> >>>>> When we will be developing/discussing SourceReader from FLIP-27 we >> might >>>>> then add: >>>>> >>>>> - in-memory.buffer.usage (0 - 100%) >>>>> >>>>> Which will be estimated automatically by Flink while user will be able >> to >>>>> override/provide better estimation. >>>>> >>>>> Piotrek >>>>> >>>>>> On 5 Jun 2019, at 05:42, Becket Qin <becket....@gmail.com> wrote: >>>>>> >>>>>> Hi Piotr, >>>>>> >>>>>> Thanks for the explanation. Please see some clarifications below. >>>>>> >>>>>> By time-based metric, I meant the portion of time spent on producing >> the >>>>>> record to downstream. For example, a source connector can report that >>>>> it's >>>>>> spending 80% of time to emit record to downstream processing pipeline. >>>>> In >>>>>> another case, a sink connector may report that its spending 30% of >> time >>>>>> producing the records to the external system. >>>>>> >>>>>> This is in some sense equivalent to the buffer usage metric: >>>>>> - 80% of time spent on emitting records to downstream ---> downstream >>>>>> node is bottleneck ---> output buffer is probably full. >>>>>> - 30% of time spent on emitting records to downstream ---> downstream >>>>>> node is not bottleneck ---> output buffer is probably not full. >>>>>> >>>>>> However, the time-based metric has a few advantages that the buffer >>>>> usage >>>>>> metric may not have. >>>>>> >>>>>> 1. Buffer usage metric may not be applicable to all the connector >>>>>> implementations, while reporting time-based metric are always doable. >>>>>> Some source connectors may not have any input buffer, or they may use >>>>> some >>>>>> third party library that does not expose the input buffer at all. >>>>>> Similarly, for sink connectors, the implementation may not have any >>>>> output >>>>>> buffer, or the third party library does not expose such buffer. >>>>>> >>>>>> 2. Although both type of metrics can detect bottleneck, time-based >>>>> metrics >>>>>> can be used to generate a more informed action to remove the >> bottleneck. >>>>>> For example, when the downstream is bottleneck, the output buffer >> usage >>>>>> metric is likely to be 100%, and the input buffer usage might be 0%. >>>>> That >>>>>> means we don't know what is the suitable parallelism to lift the >>>>>> bottleneck. The time-based metric, on the other hand, would give >> useful >>>>>> information, e.g. if 80% of time was spent on emitting records, we can >>>>>> roughly increase the downstream node parallelism by 4 times. >>>>>> >>>>>> Admittedly, the time-based metrics are more expensive than buffer >>>>> usage. So >>>>>> we may have to do some sampling to reduce the cost. But in general, >>>>> using >>>>>> time-based metrics seems worth adding. >>>>>> >>>>>> That being said, I don't think buffer usage metric and time-based >>>>> metrics >>>>>> are mutually exclusive. We can probably have both. It is just that in >>>>>> practice, features like auto-scaling might prefer time-based metrics >> for >>>>>> the reason stated above. >>>>>> >>>>>>> 1. Define the metrics that would allow us to manually detect >>>>> bottlenecks. >>>>>> As I wrote, we already have them in most of the places, except of >>>>>> sources/sinks. >>>>>>> 2. Use those metrics, to automatically detect bottlenecks. Currently >> we >>>>>> are only automatically detecting back pressure and reporting it to the >>>>> user >>>>>> in web UI (is it exposed as a metric at all?). Detecting the root >> cause >>>>> of >>>>>> the back pressure (bottleneck) is one step further. >>>>>>> 3. Use the knowledge about where exactly the bottleneck is located, >> to >>>>>> try to do something with it. >>>>>> >>>>>> As explained above, I think time-based metric also addresses item 1 >> and >>>>>> item 2. >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Jiangjie (Becket) Qin >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski <pi...@ververica.com> >>>>> wrote: >>>>>> >>>>>>> Hi again :) >>>>>>> >>>>>>>> - pending.bytes, Gauge >>>>>>>> - pending.messages, Gauge >>>>>>> >>>>>>> >>>>>>> +1 >>>>>>> >>>>>>> And true, instead of overloading one of the metric it is better when >>>>> user >>>>>>> can choose to provide only one of them. >>>>>>> >>>>>>> Re 2: >>>>>>> >>>>>>>> If I understand correctly, this metric along with the pending >> mesages >>>>> / >>>>>>>> bytes would answer the questions of: >>>>>>> >>>>>>>> - Does the connector consume fast enough? Lagging behind + empty >>>>> buffer >>>>>>> = >>>>>>>> cannot consume fast enough. >>>>>>>> - Does the connector emit fast enough? Lagging behind + full buffer >> = >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow. >>>>>>> >>>>>>> Yes, exactly. This can also be used to support decisions like >> changing >>>>> the >>>>>>> parallelism of the sources and/or down stream operators. >>>>>>> >>>>>>> I’m not sure if I understand your proposal with time based >>>>> measurements. >>>>>>> Maybe I’m missing something, but I do not see how measuring time >> alone >>>>>>> could answer the problem: where is the bottleneck. Time spent on the >>>>>>> next/emit might be short or long (depending on how heavy to process >> the >>>>>>> record is) and the source can still be bottlenecked/back pressured or >>>>> not. >>>>>>> Usually the easiest and the most reliable way how to detect >>>>> bottlenecks is >>>>>>> by checking usage of input & output buffers, since when input buffer >> is >>>>>>> full while output buffer is empty, that’s the definition of a >>>>> bottleneck. >>>>>>> Also this is usually very easy and cheap to measure (it works >>>>> effectively >>>>>>> the same way as current’s Flink back pressure monitoring, but more >>>>> cleanly, >>>>>>> without probing thread’s stack traces). >>>>>>> >>>>>>> Also keep in mind that we are already using the buffer usage metrics >>>>> for >>>>>>> detecting the bottlenecks in Flink’s internal network exchanges >> (manual >>>>>>> work). That’s the reason why I wanted to extend this to >> sources/sinks, >>>>>>> since they are currently our blind spot. >>>>>>> >>>>>>>> One feature we are currently working on to scale Flink automatically >>>>>>> relies >>>>>>>> on some metrics answering the same question >>>>>>> >>>>>>> That would be very helpful feature. I think in order to achieve that >> we >>>>>>> would need to: >>>>>>> 1. Define the metrics that would allow us to manually detect >>>>> bottlenecks. >>>>>>> As I wrote, we already have them in most of the places, except of >>>>>>> sources/sinks. >>>>>>> 2. Use those metrics, to automatically detect bottlenecks. Currently >> we >>>>>>> are only automatically detecting back pressure and reporting it to >> the >>>>> user >>>>>>> in web UI (is it exposed as a metric at all?). Detecting the root >>>>> cause of >>>>>>> the back pressure (bottleneck) is one step further. >>>>>>> 3. Use the knowledge about where exactly the bottleneck is located, >> to >>>>> try >>>>>>> to do something with it. >>>>>>> >>>>>>> I think you are aiming for point 3., but before we reach it, we are >>>>> still >>>>>>> missing 1. & 2. Also even if we have 3., there is a value in 1 & 2 >> for >>>>>>> manual analysis/dashboards. >>>>>>> >>>>>>> However, having the knowledge of where the bottleneck is, doesn’t >>>>>>> necessarily mean that we know what we can do about it. For example >>>>>>> increasing parallelism might or might not help with anything (data >>>>> skew, >>>>>>> bottleneck on some resource that does not scale), but this remark >>>>> applies >>>>>>> always, regardless of the way how did we detect the bottleneck. >>>>>>> >>>>>>> Piotrek >>>>>>> >>>>>>>> On 3 Jun 2019, at 06:16, Becket Qin <becket....@gmail.com> wrote: >>>>>>>> >>>>>>>> Hi Piotr, >>>>>>>> >>>>>>>> Thanks for the suggestion. Some thoughts below: >>>>>>>> >>>>>>>> Re 1: The pending messages / bytes. >>>>>>>> I completely agree these are very useful metrics and we should >> expect >>>>> the >>>>>>>> connector to report. WRT the way to expose them, it seems more >>>>> consistent >>>>>>>> to add two metrics instead of adding a method (unless there are >> other >>>>> use >>>>>>>> cases other than metric reporting). So we can add the following two >>>>>>> metrics. >>>>>>>> - pending.bytes, Gauge >>>>>>>> - pending.messages, Gauge >>>>>>>> Applicable connectors can choose to report them. These two metrics >>>>> along >>>>>>>> with latency should be sufficient for users to understand the >> progress >>>>>>> of a >>>>>>>> connector. >>>>>>>> >>>>>>>> >>>>>>>> Re 2: Number of buffered data in-memory of the connector >>>>>>>> If I understand correctly, this metric along with the pending >> mesages >>>>> / >>>>>>>> bytes would answer the questions of: >>>>>>>> - Does the connector consume fast enough? Lagging behind + empty >>>>> buffer >>>>>>> = >>>>>>>> cannot consume fast enough. >>>>>>>> - Does the connector emit fast enough? Lagging behind + full buffer >> = >>>>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow. >>>>>>>> >>>>>>>> One feature we are currently working on to scale Flink automatically >>>>>>> relies >>>>>>>> on some metrics answering the same question, more specifically, we >> are >>>>>>>> profiling the time spent on .next() method (time to consume) and the >>>>> time >>>>>>>> spent on .collect() method (time to emit / process). One advantage >> of >>>>>>> such >>>>>>>> method level time cost allows us to calculate the parallelism >>>>> required to >>>>>>>> keep up in case their is a lag. >>>>>>>> >>>>>>>> However, one concern I have regarding such metric is that they are >>>>>>>> implementation specific. Either profiling on the method time, or >>>>>>> reporting >>>>>>>> buffer usage assumes the connector are implemented in such a way. A >>>>>>>> slightly better solution might be have the following metric: >>>>>>>> >>>>>>>> - EmitTimeRatio (or FetchTimeRatio): The time spent on emitting >>>>>>>> records / Total time elapsed. >>>>>>>> >>>>>>>> This assumes that the source connectors have to emit the records to >>>>> the >>>>>>>> downstream at some point. The emission may take some time ( e.g. go >>>>>>> through >>>>>>>> chained operators). And the rest of the time are spent to prepare >> the >>>>>>>> record to emit, including time for consuming and format conversion, >>>>> etc. >>>>>>>> Ideally, we'd like to see the time spent on record fetch and emit to >>>>> be >>>>>>>> about the same, so no one is bottleneck for the other. >>>>>>>> >>>>>>>> The downside of these time based metrics is additional overhead to >> get >>>>>>> the >>>>>>>> time, therefore sampling might be needed. But in practice I feel >> such >>>>>>> time >>>>>>>> based metric might be more useful if we want to take action. >>>>>>>> >>>>>>>> >>>>>>>> I think we should absolutely add metrics in (1) to the metric >>>>> convention. >>>>>>>> We could also add the metrics mentioned in (2) if we reach consensus >>>>> on >>>>>>>> that. What do you think? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Jiangjie (Becket) Qin >>>>>>>> >>>>>>>> >>>>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski <pi...@ververica.com >>> >>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Becket, >>>>>>>>> >>>>>>>>> Re 1a) and 1b) +1 from my side. >>>>>>>>> >>>>>>>>> I’ve discussed this issue: >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us to check the >>>>> cause >>>>>>>>> of >>>>>>>>>>>> back pressure: >>>>>>>>>>>> a) for sources, length of input queue (in bytes? Or boolean >>>>>>>>>>>> hasSomethingl/isEmpty) >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean >>>>>>>>>>>> hasSomething/isEmpty) >>>>>>>>> >>>>>>>>> With Nico at some lengths and he also saw the benefits of them. We >>>>> also >>>>>>>>> have more concrete proposal for that. >>>>>>>>> >>>>>>>>> Actually there are two really useful metrics, that we are missing >>>>>>>>> currently: >>>>>>>>> >>>>>>>>> 1. Number of data/records/bytes in the backlog to process. For >>>>> example >>>>>>>>> remaining number of bytes in unread files. Or pending data in Kafka >>>>>>> topics. >>>>>>>>> 2. Number of buffered data in-memory of the connector, that are >>>>> waiting >>>>>>> to >>>>>>>>> be processed pushed to Flink pipeline. >>>>>>>>> >>>>>>>>> Re 1: >>>>>>>>> This would have to be a metric provided directly by a connector. It >>>>>>> could >>>>>>>>> be an undefined `int`: >>>>>>>>> >>>>>>>>> `int backlog` - estimate of pending work. >>>>>>>>> >>>>>>>>> “Undefined” meaning that it would be up to a connector to decided >>>>>>> whether >>>>>>>>> it’s measured in bytes, records, pending files or whatever it is >>>>>>> possible >>>>>>>>> to provide by the connector. This is because I assume not every >>>>>>> connector >>>>>>>>> can provide exact number and for some of them it might be >> impossible >>>>> to >>>>>>>>> provide records number of bytes count. >>>>>>>>> >>>>>>>>> Re 2: >>>>>>>>> This metric could be either provided by a connector, or calculated >>>>>>> crudely >>>>>>>>> by Flink: >>>>>>>>> >>>>>>>>> `float bufferUsage` - value from [0.0, 1.0] range >>>>>>>>> >>>>>>>>> Percentage of used in memory buffers, like in Kafka’s handover. >>>>>>>>> >>>>>>>>> It could be crudely implemented by Flink with FLIP-27 >>>>>>>>> SourceReader#isAvailable. If SourceReader is not available reported >>>>>>>>> `bufferUsage` could be 0.0. If it is available, it could be 1.0. I >>>>> think >>>>>>>>> this would be a good enough estimation for most of the use cases >>>>> (that >>>>>>>>> could be overloaded and implemented better if desired). Especially >>>>>>> since we >>>>>>>>> are reporting only probed values: if probed values are almost >> always >>>>>>> “1.0”, >>>>>>>>> it would mean that we have a back pressure. If they are almost >> always >>>>>>>>> “0.0”, there is probably no back pressure at the sources. >>>>>>>>> >>>>>>>>> What do you think about this? >>>>>>>>> >>>>>>>>> Piotrek >>>>>>>>> >>>>>>>>>> On 30 May 2019, at 11:41, Becket Qin <becket....@gmail.com> >> wrote: >>>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> Thanks a lot for all the feedback and comments. I'd like to >> continue >>>>>>> the >>>>>>>>>> discussion on this FLIP. >>>>>>>>>> >>>>>>>>>> I updated the FLIP-33 wiki to remove all the histogram metrics >> from >>>>> the >>>>>>>>>> first version of metric convention due to the performance concern. >>>>> The >>>>>>>>> plan >>>>>>>>>> is to introduce them later when we have a mechanism to opt in/out >>>>>>>>> metrics. >>>>>>>>>> At that point, users can decide whether they want to pay the cost >> to >>>>>>> get >>>>>>>>>> the metric or not. >>>>>>>>>> >>>>>>>>>> As Stephan suggested, for this FLIP, let's first try to agree on >> the >>>>>>>>> small >>>>>>>>>> list of conventional metrics that connectors should follow. >>>>>>>>>> Just to be clear, the purpose of the convention is not to enforce >>>>> every >>>>>>>>>> connector to report all these metrics, but to provide a guidance >> in >>>>>>> case >>>>>>>>>> these metrics are reported by some connectors. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> @ Stephan & Chesnay, >>>>>>>>>> >>>>>>>>>> Regarding the duplication of `RecordsIn` metric in operator / task >>>>>>>>>> IOMetricGroups, from what I understand, for source operator, it is >>>>>>>>> actually >>>>>>>>>> the SourceFunction that reports the operator level >>>>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are essentially the >>>>> same >>>>>>>>>> metric in the operator level IOMetricGroup. Similarly for the Sink >>>>>>>>>> operator, the operator level RecordsOut/RecordsOutPerSecond >> metrics >>>>> are >>>>>>>>>> also reported by the Sink function. I marked them as existing in >> the >>>>>>>>>> FLIP-33 wiki page. Please let me know if I misunderstood. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin <becket....@gmail.com> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Piotr, >>>>>>>>>>> >>>>>>>>>>> Thanks a lot for the feedback. >>>>>>>>>>> >>>>>>>>>>> 1a) I guess you are referring to the part that "original system >>>>>>> specific >>>>>>>>>>> metrics should also be reported". The performance impact depends >> on >>>>>>> the >>>>>>>>>>> implementation. An efficient implementation would only record the >>>>>>> metric >>>>>>>>>>> once, but report them with two different metric names. This is >>>>>>> unlikely >>>>>>>>> to >>>>>>>>>>> hurt performance. >>>>>>>>>>> >>>>>>>>>>> 1b) Yes, I agree that we should avoid adding overhead to the >>>>> critical >>>>>>>>> path >>>>>>>>>>> by all means. This is sometimes a tradeoff, running blindly >> without >>>>>>> any >>>>>>>>>>> metric gives best performance, but sometimes might be frustrating >>>>> when >>>>>>>>> we >>>>>>>>>>> debug some issues. >>>>>>>>>>> >>>>>>>>>>> 2. The metrics are indeed very useful. Are they supposed to be >>>>>>> reported >>>>>>>>> by >>>>>>>>>>> the connectors or Flink itself? At this point FLIP-33 is more >>>>> focused >>>>>>> on >>>>>>>>>>> provide a guidance to the connector authors on the metrics >>>>> reporting. >>>>>>>>> That >>>>>>>>>>> said, after FLIP-27, I think we should absolutely report these >>>>> metrics >>>>>>>>> in >>>>>>>>>>> the abstract implementation. In any case, the metric convention >> in >>>>>>> this >>>>>>>>>>> list are expected to evolve over time. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski < >>>>> pi...@ververica.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the proposal and driving the effort here Becket :) >> I’ve >>>>>>> read >>>>>>>>>>>> through the FLIP-33 [1], and here are couple of my thoughts. >>>>>>>>>>>> >>>>>>>>>>>> Big +1 for standardising the metric names between connectors, it >>>>> will >>>>>>>>>>>> definitely help us and users a lot. >>>>>>>>>>>> >>>>>>>>>>>> Issues/questions/things to discuss that I’ve thought of: >>>>>>>>>>>> >>>>>>>>>>>> 1a. If we are about to duplicate some metrics, can this become a >>>>>>>>>>>> performance issue? >>>>>>>>>>>> 1b. Generally speaking, we should make sure that collecting >> those >>>>>>>>> metrics >>>>>>>>>>>> is as non intrusive as possible, especially that they will need >>>>> to be >>>>>>>>>>>> updated once per record. (They might be collected more rarely >> with >>>>>>> some >>>>>>>>>>>> overhead, but the hot path of updating it per record will need >> to >>>>> be >>>>>>> as >>>>>>>>>>>> quick as possible). That includes both avoiding heavy >> computation >>>>> on >>>>>>>>> per >>>>>>>>>>>> record path: histograms?, measuring time for time based metrics >>>>> (per >>>>>>>>>>>> second) (System.currentTimeMillis() depending on the >>>>> implementation >>>>>>> can >>>>>>>>>>>> invoke a system call) >>>>>>>>>>>> >>>>>>>>>>>> 2. It would be nice to have metrics, that allow us to check the >>>>> cause >>>>>>>>> of >>>>>>>>>>>> back pressure: >>>>>>>>>>>> a) for sources, length of input queue (in bytes? Or boolean >>>>>>>>>>>> hasSomethingl/isEmpty) >>>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean >>>>>>>>>>>> hasSomething/isEmpty) >>>>>>>>>>>> >>>>>>>>>>>> a) is useful in a scenario when we are processing backlog of >>>>> records, >>>>>>>>> all >>>>>>>>>>>> of the internal Flink’s input/output network buffers are empty, >>>>> and >>>>>>> we >>>>>>>>> want >>>>>>>>>>>> to check whether the external source system is the bottleneck >>>>>>> (source’s >>>>>>>>>>>> input queue will be empty), or if the Flink’s connector is the >>>>>>>>> bottleneck >>>>>>>>>>>> (source’s input queues will be full). >>>>>>>>>>>> b) similar story. Backlog of records, but this time all of the >>>>>>> internal >>>>>>>>>>>> Flink’s input/ouput network buffers are full, and we want o >> check >>>>>>>>> whether >>>>>>>>>>>> the external sink system is the bottleneck (sink output queues >> are >>>>>>>>> full), >>>>>>>>>>>> or if the Flink’s connector is the bottleneck (sink’s output >>>>> queues >>>>>>>>> will be >>>>>>>>>>>> empty) >>>>>>>>>>>> >>>>>>>>>>>> It might be sometimes difficult to provide those metrics, so >> they >>>>>>> could >>>>>>>>>>>> be optional, but if we could provide them, it would be really >>>>>>> helpful. >>>>>>>>>>>> >>>>>>>>>>>> Piotrek >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics >>>>>>>>>>>> < >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen <se...@apache.org> >> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> I think this sounds reasonable. >>>>>>>>>>>>> >>>>>>>>>>>>> Let's keep the "reconfiguration without stopping the job" out >> of >>>>>>> this, >>>>>>>>>>>>> because that would be a super big effort and if we approach >> that, >>>>>>> then >>>>>>>>>>>> in >>>>>>>>>>>>> more generic way rather than specific to connector metrics. >>>>>>>>>>>>> >>>>>>>>>>>>> I would suggest to look at the following things before starting >>>>> with >>>>>>>>> any >>>>>>>>>>>>> implementation work: >>>>>>>>>>>>> >>>>>>>>>>>>> - Try and find a committer to support this, otherwise it will >> be >>>>>>> hard >>>>>>>>>>>> to >>>>>>>>>>>>> make progress >>>>>>>>>>>>> - Start with defining a smaller set of "core metrics" and >> extend >>>>> the >>>>>>>>>>>> set >>>>>>>>>>>>> later. I think that is easier than now blocking on reaching >>>>>>> consensus >>>>>>>>>>>> on a >>>>>>>>>>>>> large group of metrics. >>>>>>>>>>>>> - Find a solution to the problem Chesnay mentioned, that the >>>>>>> "records >>>>>>>>>>>> in" >>>>>>>>>>>>> metric is somehow overloaded and exists already in the IO >> Metric >>>>>>>>> group. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin < >> becket....@gmail.com >>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Stephan, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks a lot for the feedback. All makes sense. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It is a good suggestion to simply have an onRecord(numBytes, >>>>>>>>> eventTime) >>>>>>>>>>>>>> method for connector writers. It should meet most of the >>>>>>>>> requirements, >>>>>>>>>>>>>> individual >>>>>>>>>>>>>> >>>>>>>>>>>>>> The configurable metrics feature is something really useful, >>>>>>>>>>>> especially if >>>>>>>>>>>>>> we can somehow make it dynamically configurable without >> stopping >>>>>>> the >>>>>>>>>>>> jobs. >>>>>>>>>>>>>> It might be better to make it a separate discussion because it >>>>> is a >>>>>>>>>>>> more >>>>>>>>>>>>>> generic feature instead of only for connectors. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So in order to make some progress, in this FLIP we can limit >> the >>>>>>>>>>>> discussion >>>>>>>>>>>>>> scope to the connector related items: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - the standard connector metric names and types. >>>>>>>>>>>>>> - the abstract ConnectorMetricHandler interface >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'll start a separate thread to discuss other general metric >>>>>>> related >>>>>>>>>>>>>> enhancement items including: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - optional metrics >>>>>>>>>>>>>> - dynamic metric configuration >>>>>>>>>>>>>> - potential combination with rate limiter >>>>>>>>>>>>>> >>>>>>>>>>>>>> Does this plan sound reasonable? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen < >> se...@apache.org> >>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ignoring for a moment implementation details, this connector >>>>>>> metrics >>>>>>>>>>>> work >>>>>>>>>>>>>>> is a really good thing to do, in my opinion >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The questions "oh, my job seems to be doing nothing, I am >>>>> looking >>>>>>> at >>>>>>>>>>>> the >>>>>>>>>>>>>> UI >>>>>>>>>>>>>>> and the 'records in' value is still zero" is in the top three >>>>>>>>> support >>>>>>>>>>>>>>> questions I have been asked personally. >>>>>>>>>>>>>>> Introspection into "how far is the consumer lagging behind" >>>>> (event >>>>>>>>>>>> time >>>>>>>>>>>>>>> fetch latency) came up many times as well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So big +1 to solving this problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> About the exact design - I would try to go for the following >>>>>>>>>>>> properties: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - keep complexity of of connectors. Ideally the metrics >> handler >>>>>>> has >>>>>>>>> a >>>>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or so, and >>>>> everything >>>>>>>>>>>> else is >>>>>>>>>>>>>>> internal to the handler. That makes it dead simple for the >>>>>>>>> connector. >>>>>>>>>>>> We >>>>>>>>>>>>>>> can also think of an extensive scheme for connector specific >>>>>>>>> metrics. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - make it configurable on the job it cluster level which >>>>> metrics >>>>>>> the >>>>>>>>>>>>>>> handler internally creates when that method is invoked. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What do you think? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Stephan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay Schepler < >>>>>>>>> ches...@apache.org >>>>>>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> As I said before, I believe this to be over-engineered and >>>>> have >>>>>>> no >>>>>>>>>>>>>>>> interest in this implementation. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There are conceptual issues like defining a duplicate >>>>>>>>>>>>>> numBytesIn(PerSec) >>>>>>>>>>>>>>>> metric that already exists for each operator. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote: >>>>>>>>>>>>>>>>> A few updates to the thread. I uploaded a patch[1] as a >>>>> complete >>>>>>>>>>>>>>>>> example of how users can use the metrics in the future. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Some thoughts below after taking a look at the >>>>>>> AbstractMetricGroup >>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> its subclasses. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This patch intends to provide convenience for Flink >> connector >>>>>>>>>>>>>>>>> implementations to follow metrics standards proposed in >>>>> FLIP-33. >>>>>>>>> It >>>>>>>>>>>>>>>>> also try to enhance the metric management in general way to >>>>> help >>>>>>>>>>>>>> users >>>>>>>>>>>>>>>>> with: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. metric definition >>>>>>>>>>>>>>>>> 2. metric dependencies check >>>>>>>>>>>>>>>>> 3. metric validation >>>>>>>>>>>>>>>>> 4. metric control (turn on / off particular metrics) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the functionality >> of >>>>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses. The >>>>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the metric group >>>>>>> hierarchy, >>>>>>>>>>>> but >>>>>>>>>>>>>>>>> does not really manage the metrics other than keeping them >>>>> in a >>>>>>>>> Map. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ideally we should only have one entry point for the >> metrics. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Right now the entry point is |AbstractMetricGroup|. >> However, >>>>>>>>> besides >>>>>>>>>>>>>>>>> the missing functionality mentioned above, >>>>> |AbstractMetricGroup| >>>>>>>>>>>>>> seems >>>>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could extract it out to >>>>>>>>>>>>>>>>> flink-metrics in order to use it for generic purpose. There >>>>> will >>>>>>>>> be >>>>>>>>>>>>>>>>> some work, though. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Another approach is to make |AbstractMetrics| in this patch >>>>> as >>>>>>> the >>>>>>>>>>>>>>>>> metric entry point. It wraps metric group and provides the >>>>>>> missing >>>>>>>>>>>>>>>>> functionalities. Then we can roll out this pattern to >> runtime >>>>>>>>>>>>>>>>> components gradually as well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My first thought is that the latter approach gives a more >>>>> smooth >>>>>>>>>>>>>>>>> migration. But I am also OK with doing a refactoring on the >>>>>>>>>>>>>>>>> |AbstractMetricGroup| family. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] https://github.com/becketqin/flink/pull/1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin < >>>>>>> becket....@gmail.com >>>>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Chesnay, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It might be easier to discuss some implementation details >> in >>>>>>> the >>>>>>>>>>>>>>>>> PR review instead of in the FLIP discussion thread. I have >> a >>>>>>>>>>>>>> patch >>>>>>>>>>>>>>>>> for Kafka connectors ready but haven't submitted the PR >> yet. >>>>>>>>>>>>>>>>> Hopefully that will help explain a bit more. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ** Re: metric type binding >>>>>>>>>>>>>>>>> This is a valid point that worths discussing. If I >> understand >>>>>>>>>>>>>>>>> correctly, there are two points: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. Metric type / interface does not matter as long as the >>>>>>> metric >>>>>>>>>>>>>>>>> semantic is clearly defined. >>>>>>>>>>>>>>>>> Conceptually speaking, I agree that as long as the metric >>>>>>>>>>>>>> semantic >>>>>>>>>>>>>>>>> is defined, metric type does not matter. To some extent, >>>>> Gauge >>>>>>> / >>>>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can be think of as >>>>> some >>>>>>>>>>>>>>>>> well-recognized semantics, if you wish. In Flink, these >>>>> metric >>>>>>>>>>>>>>>>> semantics have their associated interface classes. In >>>>> practice, >>>>>>>>>>>>>>>>> such semantic to interface binding seems necessary for >>>>>>> different >>>>>>>>>>>>>>>>> components to communicate. Simply standardize the semantic >>>>> of >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> connector metrics seems not sufficient for people to build >>>>>>>>>>>>>>>>> ecosystem on top of. At the end of the day, we still need >> to >>>>>>>>> have >>>>>>>>>>>>>>>>> some embodiment of the metric semantics that people can >>>>> program >>>>>>>>>>>>>>>>> against. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be exposed using >>>>>>>>>>>>>>>>> different metric types / interfaces. >>>>>>>>>>>>>>>>> This is a good point. Counter and Gauge-as-a-Counter are >>>>> pretty >>>>>>>>>>>>>>>>> much interchangeable. This is more of a trade-off between >> the >>>>>>>>>>>>>> user >>>>>>>>>>>>>>>>> experience of metric producers and consumers. The metric >>>>>>>>>>>>>> producers >>>>>>>>>>>>>>>>> want to use Counter or Gauge depending on whether the >> counter >>>>>>> is >>>>>>>>>>>>>>>>> already tracked in code, while ideally the metric consumers >>>>>>> only >>>>>>>>>>>>>>>>> want to see a single metric type for each metric. I am >>>>> leaning >>>>>>>>>>>>>>>>> towards to make the metric producers happy, i.e. allow >> Gauge >>>>> / >>>>>>>>>>>>>>>>> Counter metric type, and the the metric consumers handle >> the >>>>>>>>> type >>>>>>>>>>>>>>>>> variation. The reason is that in practice, there might be >>>>> more >>>>>>>>>>>>>>>>> connector implementations than metric reporter >>>>> implementations. >>>>>>>>>>>>>> We >>>>>>>>>>>>>>>>> could also provide some helper method to facilitate reading >>>>>>> from >>>>>>>>>>>>>>>>> such variable metric type. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just some quick replies to the comments around >> implementation >>>>>>>>>>>>>>>> details. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 4) single place where metrics are registered except >>>>>>>>>>>>>>>>> connector-specific >>>>>>>>>>>>>>>>> ones (which we can't really avoid). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Register connector specific ones in a single place is >>>>> actually >>>>>>>>>>>>>>>>> something that I want to achieve. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2) I'm talking about time-series databases like >>>>> Prometheus. >>>>>>>>>>>>>> We >>>>>>>>>>>>>>>>> would >>>>>>>>>>>>>>>>> only have a gauge metric exposing the last >>>>>>>>> fetchTime/emitTime >>>>>>>>>>>>>>>>> that is >>>>>>>>>>>>>>>>> regularly reported to the backend (Prometheus), where a >>>>>>> user >>>>>>>>>>>>>>>>> could build >>>>>>>>>>>>>>>>> a histogram of his choosing when/if he wants it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Not sure if such downsampling works. As an example, if a >> user >>>>>>>>>>>>>>>>> complains that there are some intermittent latency spikes >>>>>>> (maybe >>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> few records in 10 seconds) in their processing system. >>>>> Having a >>>>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems unlikely useful. >>>>>>>>>>>>>>>>> However by looking at actual 99.9 percentile latency might >>>>>>> help. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay Schepler >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Re: over complication of implementation. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I think I get understand better know what you're >> shooting >>>>>>>>>>>>>> for, >>>>>>>>>>>>>>>>> effectively something like the OperatorIOMetricGroup. >>>>>>>>>>>>>>>>> But still, re-define setupConnectorMetrics() to accept a >>>>>>> set >>>>>>>>>>>>>>>>> of flags >>>>>>>>>>>>>>>>> for counters/meters(ans _possibly_ histograms) along >>>>> with a >>>>>>>>>>>>>>>>> set of >>>>>>>>>>>>>>>>> well-defined Optional<Gauge<?>>, and return the group. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Solves all issues as far as i can tell: >>>>>>>>>>>>>>>>> 1) no metrics must be created manually (except Gauges, >>>>>>> which >>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>> effectively just Suppliers and you can't get around >>>>> this), >>>>>>>>>>>>>>>>> 2) additional metrics can be registered on the returned >>>>>>>>>>>>>> group, >>>>>>>>>>>>>>>>> 3) see 1), >>>>>>>>>>>>>>>>> 4) single place where metrics are registered except >>>>>>>>>>>>>>>>> connector-specific >>>>>>>>>>>>>>>>> ones (which we can't really avoid). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Re: Histogram >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1) As an example, whether "numRecordsIn" is exposed as a >>>>>>>>>>>>>>>>> Counter or a >>>>>>>>>>>>>>>>> Gauge should be irrelevant. So far we're using the >> metric >>>>>>>>>>>>>> type >>>>>>>>>>>>>>>>> that is >>>>>>>>>>>>>>>>> the most convenient at exposing a given value. If there >>>>> is >>>>>>>>>>>>>>>>> some backing >>>>>>>>>>>>>>>>> data-structure that we want to expose some data from we >>>>>>>>>>>>>>>>> typically opt >>>>>>>>>>>>>>>>> for a Gauge, as otherwise we're just mucking around with >>>>>>> the >>>>>>>>>>>>>>>>> Meter/Counter API to get it to match. Similarly, if we >>>>> want >>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> count >>>>>>>>>>>>>>>>> something but no current count exists we typically added >>>>> a >>>>>>>>>>>>>>>>> Counter. >>>>>>>>>>>>>>>>> That's why attaching semantics to metric types makes >>>>> little >>>>>>>>>>>>>>>>> sense (but >>>>>>>>>>>>>>>>> unfortunately several reporters already do it); for >>>>>>>>>>>>>>>>> counters/meters >>>>>>>>>>>>>>>>> certainly, but the majority of metrics are gauges. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2) I'm talking about time-series databases like >>>>> Prometheus. >>>>>>>>>>>>>> We >>>>>>>>>>>>>>>>> would >>>>>>>>>>>>>>>>> only have a gauge metric exposing the last >>>>>>>>> fetchTime/emitTime >>>>>>>>>>>>>>>>> that is >>>>>>>>>>>>>>>>> regularly reported to the backend (Prometheus), where a >>>>>>> user >>>>>>>>>>>>>>>>> could build >>>>>>>>>>>>>>>>> a histogram of his choosing when/if he wants it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 22.02.2019 13:57, Becket Qin wrote: >>>>>>>>>>>>>>>>>> Hi Chesnay, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for the explanation. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ** Re: FLIP >>>>>>>>>>>>>>>>>> I might have misunderstood this, but it seems that "major >>>>>>>>>>>>>>>>> changes" are well >>>>>>>>>>>>>>>>>> defined in FLIP. The full contents is following: >>>>>>>>>>>>>>>>>> What is considered a "major change" that needs a FLIP? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Any of the following should be considered a major change: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - Any major new feature, subsystem, or piece of >>>>>>>>>>>>>>>>> functionality >>>>>>>>>>>>>>>>>> - *Any change that impacts the public interfaces of the >>>>>>>>>>>>>>>>> project* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> What are the "public interfaces" of the project? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *All of the following are public interfaces *that people >>>>>>>>>>>>>>>>> build around: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - DataStream and DataSet API, including classes related >>>>>>>>>>>>>>>>> to that, such as >>>>>>>>>>>>>>>>>> StreamExecutionEnvironment >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - Classes marked with the @Public annotation >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - On-disk binary formats, such as >>>>>>>>>>>>>> checkpoints/savepoints >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - User-facing scripts/command-line tools, i.e. >>>>>>>>>>>>>>>>> bin/flink, Yarn scripts, >>>>>>>>>>>>>>>>>> Mesos scripts >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - Configuration settings >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - *Exposed monitoring information* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> So any monitoring information change is considered as >>>>>>>>>>>>>> public >>>>>>>>>>>>>>>>> interface, and >>>>>>>>>>>>>>>>>> any public interface change is considered as a "major >>>>>>>>>>>>>>> change". >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ** Re: over complication of implementation. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Although this is more of implementation details that is >> not >>>>>>>>>>>>>>>>> covered by the >>>>>>>>>>>>>>>>>> FLIP. But it may be worth discussing. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> First of all, I completely agree that we should use the >>>>>>>>>>>>>>>>> simplest way to >>>>>>>>>>>>>>>>>> achieve our goal. To me the goal is the following: >>>>>>>>>>>>>>>>>> 1. Clear connector conventions and interfaces. >>>>>>>>>>>>>>>>>> 2. The easiness of creating a connector. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Both of them are important to the prosperity of the >>>>>>>>>>>>>>>>> connector ecosystem. So >>>>>>>>>>>>>>>>>> I'd rather abstract as much as possible on our side to >> make >>>>>>>>>>>>>>>>> the connector >>>>>>>>>>>>>>>>>> developer's work lighter. Given this goal, a static util >>>>>>>>>>>>>>>>> method approach >>>>>>>>>>>>>>>>>> might have a few drawbacks: >>>>>>>>>>>>>>>>>> 1. Users still have to construct the metrics by >> themselves. >>>>>>>>>>>>>>>>> (And note that >>>>>>>>>>>>>>>>>> this might be erroneous by itself. For example, a customer >>>>>>>>>>>>>>>>> wrapper around >>>>>>>>>>>>>>>>>> dropwizard meter maybe used instead of MeterView). >>>>>>>>>>>>>>>>>> 2. When connector specific metrics are added, it is >>>>>>>>>>>>>>>>> difficult to enforce >>>>>>>>>>>>>>>>>> the scope to be the same as standard metrics. >>>>>>>>>>>>>>>>>> 3. It seems that a method proliferation is inevitable if >> we >>>>>>>>>>>>>>>>> want to apply >>>>>>>>>>>>>>>>>> sanity checks. e.g. The metric of numBytesIn was not >>>>>>>>>>>>>>>>> registered for a meter. >>>>>>>>>>>>>>>>>> 4. Metrics are still defined in random places and hard to >>>>>>>>>>>>>>>> track. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The current PR I had was inspired by the Config system in >>>>>>>>>>>>>>>>> Kafka, which I >>>>>>>>>>>>>>>>>> found pretty handy. In fact it is not only used by Kafka >>>>>>>>>>>>>>>>> itself but even >>>>>>>>>>>>>>>>>> some other projects that depend on Kafka. I am not saying >>>>>>>>>>>>>>>>> this approach is >>>>>>>>>>>>>>>>>> perfect. But I think it worths to save the work for >>>>>>>>>>>>>>>>> connector writers and >>>>>>>>>>>>>>>>>> encourage more systematic implementation. That being said, >>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>> am fully open >>>>>>>>>>>>>>>>>> to suggestions. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Re: Histogram >>>>>>>>>>>>>>>>>> I think there are two orthogonal questions around those >>>>>>>>>>>>>>>> metrics: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just looking at the >>>>>>>>>>>>>>>>> meaning of a >>>>>>>>>>>>>>>>>> metric, is generic to all connectors? If the answer is >> yes, >>>>>>>>>>>>>>>>> we should >>>>>>>>>>>>>>>>>> include the metric into the convention. No matter whether >>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>> include it >>>>>>>>>>>>>>>>>> into the convention or not, some connector implementations >>>>>>>>>>>>>>>>> will emit such >>>>>>>>>>>>>>>>>> metric. It is better to have a convention than letting >> each >>>>>>>>>>>>>>>>> connector do >>>>>>>>>>>>>>>>>> random things. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2. If a standard metric is a histogram, what should we do? >>>>>>>>>>>>>>>>>> I agree that we should make it clear that using histograms >>>>>>>>>>>>>>>>> will have >>>>>>>>>>>>>>>>>> performance risk. But I do see histogram is useful in some >>>>>>>>>>>>>>>>> fine-granularity >>>>>>>>>>>>>>>>>> debugging where one do not have the luxury to stop the >>>>>>>>>>>>>>>>> system and inject >>>>>>>>>>>>>>>>>> more inspection code. So the workaround I am thinking is >> to >>>>>>>>>>>>>>>>> provide some >>>>>>>>>>>>>>>>>> implementation suggestions. Assume later on we have a >>>>>>>>>>>>>>>>> mechanism of >>>>>>>>>>>>>>>>>> selective metrics. In the abstract metrics class we can >>>>>>>>>>>>>>>>> disable those >>>>>>>>>>>>>>>>>> metrics by default individual connector writers does not >>>>>>>>>>>>>>>>> have to do >>>>>>>>>>>>>>>>>> anything (this is another advantage of having an >>>>>>>>>>>>>>>>> AbstractMetrics instead of >>>>>>>>>>>>>>>>>> static util methods.) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I am not sure I fully understand the histogram in the >>>>>>>>>>>>>>>>> backend approach. Can >>>>>>>>>>>>>>>>>> you explain a bit more? Do you mean emitting the raw data, >>>>>>>>>>>>>>>>> e.g. fetchTime >>>>>>>>>>>>>>>>>> and emitTime with each record and let the histogram >>>>>>>>>>>>>>>>> computation happen in >>>>>>>>>>>>>>>>>> the background? Or let the processing thread putting the >>>>>>>>>>>>>>>>> values into a >>>>>>>>>>>>>>>>>> queue and have a separate thread polling from the queue >> and >>>>>>>>>>>>>>>>> add them into >>>>>>>>>>>>>>>>>> the histogram? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Re: Flip >>>>>>>>>>>>>>>>>>> The very first line under both the main header and >> Purpose >>>>>>>>>>>>>>>>> section >>>>>>>>>>>>>>>>>>> describe Flips as "major changes", which this isn't. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Re: complication >>>>>>>>>>>>>>>>>>> I'm not arguing against standardization, but again an >>>>>>>>>>>>>>>>> over-complicated >>>>>>>>>>>>>>>>>>> implementation when a static utility method would be >>>>>>>>>>>>>>>>> sufficient. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> public static void setupConnectorMetrics( >>>>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup, >>>>>>>>>>>>>>>>>>> String connectorName, >>>>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn, >>>>>>>>>>>>>>>>>>> ...) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This gives you all you need: >>>>>>>>>>>>>>>>>>> * a well-defined set of metrics for a connector to opt-in >>>>>>>>>>>>>>>>>>> * standardized naming schemes for scope and individual >>>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>>>>> * standardize metric types (although personally I'm not >>>>>>>>>>>>>>>>> interested in that >>>>>>>>>>>>>>>>>>> since metric types should be considered syntactic sugar) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Re: Configurable Histogram >>>>>>>>>>>>>>>>>>> If anything they _must_ be turned off by default, but the >>>>>>>>>>>>>>>>> metric system is >>>>>>>>>>>>>>>>>>> already exposing so many options that I'm not too keen on >>>>>>>>>>>>>>>>> adding even more. >>>>>>>>>>>>>>>>>>> You have also only addressed my first argument against >>>>>>>>>>>>>>>>> histograms >>>>>>>>>>>>>>>>>>> (performance), the second one still stands (calculate >>>>>>>>>>>>>>>>> histogram in metric >>>>>>>>>>>>>>>>>>> backends instead). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote: >>>>>>>>>>>>>>>>>>>> Hi Chesnay, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is worthy of a >> FLIP >>>>>>>>>>>>>>>>> because it is >>>>>>>>>>>>>>>>>>>> public API. According to the FLIP description a FlIP is >>>>>>>>>>>>>>>>> required in case >>>>>>>>>>>>>>>>>>> of: >>>>>>>>>>>>>>>>>>>> - Any change that impacts the public interfaces of >>>>>>>>>>>>>>>>> the project >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> and the following entry is found in the definition of >>>>>>>>>>>>>>>>> "public interface". >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - Exposed monitoring information >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Metrics are critical to any production system. So a >> clear >>>>>>>>>>>>>>>>> metric >>>>>>>>>>>>>>>>>>> definition >>>>>>>>>>>>>>>>>>>> is important for any serious users. For an organization >>>>>>>>>>>>>>>>> with large Flink >>>>>>>>>>>>>>>>>>>> installation, change in metrics means great amount of >>>>>>>>>>>>>>>>> work. So such >>>>>>>>>>>>>>>>>>> changes >>>>>>>>>>>>>>>>>>>> do need to be fully discussed and documented. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ** Re: Histogram. >>>>>>>>>>>>>>>>>>>> We can discuss whether there is a better way to expose >>>>>>>>>>>>>>>>> metrics that are >>>>>>>>>>>>>>>>>>>> suitable for histograms. My micro-benchmark on various >>>>>>>>>>>>>>>>> histogram >>>>>>>>>>>>>>>>>>>> implementations also indicates that they are >>>>>>>>>>>>>> significantly >>>>>>>>>>>>>>>>> slower than >>>>>>>>>>>>>>>>>>>> other metric types. But I don't think that means never >>>>>>>>>>>>>> use >>>>>>>>>>>>>>>>> histogram, but >>>>>>>>>>>>>>>>>>>> means use it with caution. For example, we can suggest >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> implementations >>>>>>>>>>>>>>>>>>>> to turn them off by default and only turn it on for a >>>>>>>>>>>>>>>>> small amount of >>>>>>>>>>>>>>>>>>> time >>>>>>>>>>>>>>>>>>>> when performing some micro-debugging. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ** Re: complication: >>>>>>>>>>>>>>>>>>>> Connector conventions are essential for Flink ecosystem. >>>>>>>>>>>>>>>>> Flink connectors >>>>>>>>>>>>>>>>>>>> pool is probably the most important part of Flink, just >>>>>>>>>>>>>>>>> like any other >>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>> system. Clear conventions of connectors will help build >>>>>>>>>>>>>>>>> Flink ecosystem >>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>> a more organic way. >>>>>>>>>>>>>>>>>>>> Take the metrics convention as an example, imagine >>>>>>>>>>>>>> someone >>>>>>>>>>>>>>>>> has developed >>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>> Flink connector for System foo, and another developer >> may >>>>>>>>>>>>>>>>> have developed >>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for Flink which >>>>>>>>>>>>>>>>> analyzes the Flink >>>>>>>>>>>>>>>>>>> job >>>>>>>>>>>>>>>>>>>> performance based on metrics. With a clear metric >>>>>>>>>>>>>>>>> convention, those two >>>>>>>>>>>>>>>>>>>> projects could be developed independently. Once users >> put >>>>>>>>>>>>>>>>> them together, >>>>>>>>>>>>>>>>>>>> it would work without additional modifications. This >>>>>>>>>>>>>>>>> cannot be easily >>>>>>>>>>>>>>>>>>>> achieved by just defining a few constants. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ** Re: selective metrics: >>>>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate thread. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> @Dawid >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency >>>>>>>>>>>>>>>>>>>> The primary purpose of establish such a convention is to >>>>>>>>>>>>>>>>> help developers >>>>>>>>>>>>>>>>>>>> write connectors in a more compatible way. The >> convention >>>>>>>>>>>>>>>>> is supposed to >>>>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>> defined more proactively. So when look at the >> convention, >>>>>>>>>>>>>>>>> it seems more >>>>>>>>>>>>>>>>>>>> important to see if the concept is applicable to >>>>>>>>>>>>>>>>> connectors in general. >>>>>>>>>>>>>>>>>>> It >>>>>>>>>>>>>>>>>>>> might be true so far only Kafka connector reports >>>>>>>>>>>>>> latency. >>>>>>>>>>>>>>>>> But there >>>>>>>>>>>>>>>>>>> might >>>>>>>>>>>>>>>>>>>> be hundreds of other connector implementations in the >>>>>>>>>>>>>>>>> Flink ecosystem, >>>>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of them also >> emits >>>>>>>>>>>>>>>>> latency. I >>>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>>> a lot of other sources actually also has an append >>>>>>>>>>>>>>>>> timestamp. e.g. >>>>>>>>>>>>>>>>>>> database >>>>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I wouldn't be surprised >>>>>>>>>>>>>>>>> if some database >>>>>>>>>>>>>>>>>>>> connector can also emit latency metrics. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler >>>>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to investigate this >>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>> part of this >>>>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of interest for >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> entire metric >>>>>>>>>>>>>>>>>>>>> system, and should be designed for as such. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to the core of >>>>>>>>>>>>>>>>> Flink. They are >>>>>>>>>>>>>>>>>>>>> significantly more expensive than other metrics, and >>>>>>>>>>>>>>>>> calculating >>>>>>>>>>>>>>>>>>>>> histograms in the application is regarded as an >>>>>>>>>>>>>>>>> anti-pattern by several >>>>>>>>>>>>>>>>>>>>> metric backends, who instead recommend to expose the >> raw >>>>>>>>>>>>>>>>> data and >>>>>>>>>>>>>>>>>>>>> calculate the histogram in the backend. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Second, this seems overly complicated. Given that we >>>>>>>>>>>>>>>>> already established >>>>>>>>>>>>>>>>>>>>> that not all connectors will export all metrics we are >>>>>>>>>>>>>>>>> effectively >>>>>>>>>>>>>>>>>>>>> reducing this down to a consistent naming scheme. We >>>>>>>>>>>>>>>>> don't need anything >>>>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a few constants >>>>>>>>>>>>>>>>> that all >>>>>>>>>>>>>>>>>>>>> connectors use. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of a FLIP. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz wrote: >>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I agree. But those >>>>>>>>>>>>>>>>> particular metrics >>>>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now would only be >>>>>>>>>>>>>>>>> reported if user uses >>>>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal timestampAssigner with >>>>>>>>>>>>>>>>> StreamCharacteristic >>>>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds like a very >>>>>>>>>>>>>>>>> specific case. I am >>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic metric that will >>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>>>> disabled/absent for most of implementations. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal issue, that might >>>>>>>>>>>>>>>>> make sense to >>>>>>>>>>>>>>>>>>>>>> investigate in the future. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Dawid >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote: >>>>>>>>>>>>>>>>>>>>>>> Hi Dawid, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes sense to me. >> There >>>>>>>>>>>>>>>>> are two cases >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>>>>> addressed. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a guidance. It is >>>>>>>>>>>>>>>>> likely that a >>>>>>>>>>>>>>>>>>>>> connector >>>>>>>>>>>>>>>>>>>>>>> only supports some but not all of the metrics. In >> that >>>>>>>>>>>>>>>>> case, each >>>>>>>>>>>>>>>>>>>>> connector >>>>>>>>>>>>>>>>>>>>>>> implementation should have the freedom to decide >> which >>>>>>>>>>>>>>>>> metrics are >>>>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are supported, the >>>>>>>>>>>>>>>>> guidance should be >>>>>>>>>>>>>>>>>>>>>>> followed. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable certain >> metrics >>>>>>>>>>>>>>>>> for some reason >>>>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of data). A generic >>>>>>>>>>>>>>>>> mechanism should >>>>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>>>>> provided to allow user choose which metrics are >>>>>>>>>>>>>>>>> reported. This >>>>>>>>>>>>>>>>>>> mechanism >>>>>>>>>>>>>>>>>>>>>>> should also be honored by the connector >>>>>>>>>>>>>> implementations. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz < >>>>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto:dwysakow...@apache.org >>>> >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a unified, >>>>>>>>>>>>>>>>> standard set of >>>>>>>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight concerns about >>>>>>>>>>>>>>>>> fetchLatency and >>>>>>>>>>>>>>>>>>>>>>>> latency though. They are computed based on EventTime >>>>>>>>>>>>>>>>> which is not a >>>>>>>>>>>>>>>>>>>>> purely >>>>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often on some business >>>>>>>>>>>>>>>>> logic, might be >>>>>>>>>>>>>>>>>>>>> absent >>>>>>>>>>>>>>>>>>>>>>>> or defined after source. Those metrics could also >>>>>>>>>>>>>>>>> behave in a weird >>>>>>>>>>>>>>>>>>>>> way in >>>>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore I am not sure >> if >>>>>>>>>>>>>>>>> we should >>>>>>>>>>>>>>>>>>> include >>>>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we could at least >>>>>>>>>>>>>>>>> introduce a feature >>>>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Dawid >>>>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to the proposed >>>>>>>>>>>>>>>>> metrics. I'll start a >>>>>>>>>>>>>>>>>>>>>>>> voting thread later toady. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin >>>>>>>>>>>>>>>>> <becket....@gmail.com <mailto:becket....@gmail.com>> < >>>>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto:becket....@gmail.com>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP discussion thread >>>>>>>>>>>>>> about >>>>>>>>>>>>>>>>> standardize >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> connector metrics. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a convention of >>>>>>>>>>>>>>>>> Flink connector >>>>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the monitoring and >>>>>>>>>>>>>>>>> alerting on Flink >>>>>>>>>>>>>>>>>>>>> jobs. >>>>>>>>>>>>>>>>>>>>>>>> The FLIP link is following: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics >>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >> >>