Re: [DISCUSS] FLIP-33: Standardize connector metrics

Piotr Nowojski Thu, 13 Jun 2019 00:51:54 -0700

Thanks for driving this. I’ve just noticed one small thing. With new 
SourceReader interface Flink will be able to provide `idleTime` metric 
automatically.


Piotrek

> On 13 Jun 2019, at 03:30, Becket Qin <becket....@gmail.com> wrote:
> 
> Thanks all for the feedback and discussion.
> 
> Since there wasn't any concern raised, I've started the voting thread for
> this FLIP, but please feel free to continue the discussion here if you
> think something still needs to be addressed.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> 
> On Mon, Jun 10, 2019 at 9:10 AM Becket Qin <becket....@gmail.com> wrote:
> 
>> Hi Piotr,
>> 
>> Thanks for the comments. Yes, you are right. Users will have to look at
>> other metrics to decide whether the pipeline is healthy or not in the first
>> place before they can use the time-based metric to fix the bottleneck.
>> 
>> I agree that once we have FLIP-27 ready, some of the metrics can just be
>> reported by the abstract implementation.
>> 
>> I've updated FLIP-33 wiki page to add the pendingBytes and pendingRecords
>> metric. Please let me know if you have any concern over the updated metric
>> convention proposal.
>> 
>> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen
>> <step...@ververica.com> will you also have time to take a look at the
>> proposed metric convention? If there is no further concern I'll start a
>> voting thread for this FLIP in two days.
>> 
>> Thanks,
>> 
>> Jiangjie (Becket) Qin
>> 
>> 
>> 
>> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski <pi...@ververica.com> wrote:
>> 
>>> Hi Becket,
>>> 
>>> Thanks for the answer :)
>>> 
>>>> By time-based metric, I meant the portion of time spent on producing the
>>>> record to downstream. For example, a source connector can report that
>>> it's
>>>> spending 80% of time to emit record to downstream processing pipeline.
>>> In
>>>> another case, a sink connector may report that its spending 30% of time
>>>> producing the records to the external system.
>>>> 
>>>> This is in some sense equivalent to the buffer usage metric:
>>> 
>>>>  - 80% of time spent on emitting records to downstream ---> downstream
>>>> node is bottleneck ---> output buffer is probably full.
>>>>  - 30% of time spent on emitting records to downstream ---> downstream
>>>> node is not bottleneck ---> output buffer is probably not full.
>>> 
>>> If by “time spent on emitting records to downstream” you understand
>>> “waiting on back pressure”, then I see your point. And I agree that some
>>> kind of ratio/time based metric gives you more information. However under
>>> “time spent on emitting records to downstream” might be hidden the
>>> following (extreme) situation:
>>> 
>>> 1. Job is barely able to handle influx of records, there is 99%
>>> CPU/resource usage in the cluster, but nobody is
>>> bottlenecked/backpressured, all output buffers are empty, everybody is
>>> waiting in 1% of it’s time for more records to process.
>>> 2. 80% time can still be spent on "down stream operators”, because they
>>> are the CPU intensive operations, but this doesn’t mean that increasing the
>>> parallelism down the stream will help with anything there. To the contrary,
>>> increasing parallelism of the source operator might help to increase
>>> resource utilisation up to 100%.
>>> 
>>> However, this “time based/ratio” approach can be extended to in/output
>>> buffer usage. Besides collecting an information that input/output buffer is
>>> full/empty, we can probe profile how often are buffer empty/full. If output
>>> buffer is full 1% of times, there is almost no back pressure. If it’s full
>>> 80% of times, there is some back pressure, if it’s full 99.9% of times,
>>> there is huge back pressure.
>>> 
>>> Now for autoscaling you could compare the input & output buffers fill
>>> ratio:
>>> 
>>> 1. Both are high, the source of bottleneck is down the stream
>>> 2. Output is low, input is high, this is the bottleneck and the higher
>>> the difference, the bigger source of bottleneck is this is operator/task
>>> 3. Output is high, input is low - there was some load spike that we are
>>> currently finishing to process
>>> 
>>> 
>>> 
>>> But long story short, we are probably diverging from the topic of this
>>> discussion, and we can discuss this at some later point.
>>> 
>>> For now, for sources:
>>> 
>>> as I wrote before, +1 for:
>>> - pending.bytes, Gauge
>>> - pending.messages, Gauge
>>> 
>>> When we will be developing/discussing SourceReader from FLIP-27 we might
>>> then add:
>>> 
>>> - in-memory.buffer.usage (0 - 100%)
>>> 
>>> Which will be estimated automatically by Flink while user will be able to
>>> override/provide better estimation.
>>> 
>>> Piotrek
>>> 
>>>> On 5 Jun 2019, at 05:42, Becket Qin <becket....@gmail.com> wrote:
>>>> 
>>>> Hi Piotr,
>>>> 
>>>> Thanks for the explanation. Please see some clarifications below.
>>>> 
>>>> By time-based metric, I meant the portion of time spent on producing the
>>>> record to downstream. For example, a source connector can report that
>>> it's
>>>> spending 80% of time to emit record to downstream processing pipeline.
>>> In
>>>> another case, a sink connector may report that its spending 30% of time
>>>> producing the records to the external system.
>>>> 
>>>> This is in some sense equivalent to the buffer usage metric:
>>>>  - 80% of time spent on emitting records to downstream ---> downstream
>>>> node is bottleneck ---> output buffer is probably full.
>>>>  - 30% of time spent on emitting records to downstream ---> downstream
>>>> node is not bottleneck ---> output buffer is probably not full.
>>>> 
>>>> However, the time-based metric has a few advantages that the buffer
>>> usage
>>>> metric may not have.
>>>> 
>>>> 1.  Buffer usage metric may not be applicable to all the connector
>>>> implementations, while reporting time-based metric are always doable.
>>>> Some source connectors may not have any input buffer, or they may use
>>> some
>>>> third party library that does not expose the input buffer at all.
>>>> Similarly, for sink connectors, the implementation may not have any
>>> output
>>>> buffer, or the third party library does not expose such buffer.
>>>> 
>>>> 2. Although both type of metrics can detect bottleneck, time-based
>>> metrics
>>>> can be used to generate a more informed action to remove the bottleneck.
>>>> For example, when the downstream is bottleneck, the output buffer usage
>>>> metric is likely to be 100%, and the input buffer usage might be 0%.
>>> That
>>>> means we don't know what is the suitable parallelism to lift the
>>>> bottleneck. The time-based metric, on the other hand, would give useful
>>>> information, e.g. if 80% of time was spent on emitting records, we can
>>>> roughly increase the downstream node parallelism by 4 times.
>>>> 
>>>> Admittedly, the time-based metrics are more expensive than buffer
>>> usage. So
>>>> we may have to do some sampling to reduce the cost. But in general,
>>> using
>>>> time-based metrics seems worth adding.
>>>> 
>>>> That being said, I don't think buffer usage metric and time-based
>>> metrics
>>>> are mutually exclusive. We can probably have both. It is just that in
>>>> practice, features like auto-scaling might prefer time-based metrics for
>>>> the reason stated above.
>>>> 
>>>>> 1. Define the metrics that would allow us to manually detect
>>> bottlenecks.
>>>> As I wrote, we already have them in most of the places, except of
>>>> sources/sinks.
>>>>> 2. Use those metrics, to automatically detect bottlenecks. Currently we
>>>> are only automatically detecting back pressure and reporting it to the
>>> user
>>>> in web UI (is it exposed as a metric at all?). Detecting the root cause
>>> of
>>>> the back pressure (bottleneck) is one step further.
>>>>> 3. Use the knowledge about where exactly the bottleneck is located, to
>>>> try to do something with it.
>>>> 
>>>> As explained above, I think time-based metric also addresses item 1 and
>>>> item 2.
>>>> 
>>>> Any thoughts？
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski <pi...@ververica.com>
>>> wrote:
>>>> 
>>>>> Hi again :)
>>>>> 
>>>>>> - pending.bytes, Gauge
>>>>>> - pending.messages, Gauge
>>>>> 
>>>>> 
>>>>> +1
>>>>> 
>>>>> And true, instead of overloading one of the metric it is better when
>>> user
>>>>> can choose to provide only one of them.
>>>>> 
>>>>> Re 2:
>>>>> 
>>>>>> If I understand correctly, this metric along with the pending mesages
>>> /
>>>>>> bytes would answer the questions of:
>>>>> 
>>>>>> - Does the connector consume fast enough? Lagging behind + empty
>>> buffer
>>>>> =
>>>>>> cannot consume fast enough.
>>>>>> - Does the connector emit fast enough? Lagging behind + full buffer =
>>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow.
>>>>> 
>>>>> Yes, exactly. This can also be used to support decisions like changing
>>> the
>>>>> parallelism of the sources and/or down stream operators.
>>>>> 
>>>>> I’m not sure if I understand your proposal with time based
>>> measurements.
>>>>> Maybe I’m missing something, but I do not see how measuring time alone
>>>>> could answer the problem: where is the bottleneck. Time spent on the
>>>>> next/emit might be short or long (depending on how heavy to process the
>>>>> record is) and the source can still be bottlenecked/back pressured or
>>> not.
>>>>> Usually the easiest and the most reliable way how to detect
>>> bottlenecks is
>>>>> by checking usage of input & output buffers, since when input buffer is
>>>>> full while output buffer is empty, that’s the definition of a
>>> bottleneck.
>>>>> Also this is usually very easy and cheap to measure (it works
>>> effectively
>>>>> the same way as current’s Flink back pressure monitoring, but more
>>> cleanly,
>>>>> without probing thread’s stack traces).
>>>>> 
>>>>> Also keep in mind that we are already using the buffer usage metrics
>>> for
>>>>> detecting the bottlenecks in Flink’s internal network exchanges (manual
>>>>> work). That’s the reason why I wanted to extend this to sources/sinks,
>>>>> since they are currently our blind spot.
>>>>> 
>>>>>> One feature we are currently working on to scale Flink automatically
>>>>> relies
>>>>>> on some metrics answering the same question
>>>>> 
>>>>> That would be very helpful feature. I think in order to achieve that we
>>>>> would need to:
>>>>> 1. Define the metrics that would allow us to manually detect
>>> bottlenecks.
>>>>> As I wrote, we already have them in most of the places, except of
>>>>> sources/sinks.
>>>>> 2. Use those metrics, to automatically detect bottlenecks. Currently we
>>>>> are only automatically detecting back pressure and reporting it to the
>>> user
>>>>> in web UI (is it exposed as a metric at all?). Detecting the root
>>> cause of
>>>>> the back pressure (bottleneck) is one step further.
>>>>> 3. Use the knowledge about where exactly the bottleneck is located, to
>>> try
>>>>> to do something with it.
>>>>> 
>>>>> I think you are aiming for point 3., but before we reach it, we are
>>> still
>>>>> missing 1. & 2. Also even if we have 3., there is a value in 1 & 2 for
>>>>> manual analysis/dashboards.
>>>>> 
>>>>> However, having the knowledge of where the bottleneck is, doesn’t
>>>>> necessarily mean that we know what we can do about it. For example
>>>>> increasing parallelism might or might not help with anything (data
>>> skew,
>>>>> bottleneck on some resource that does not scale), but this remark
>>> applies
>>>>> always, regardless of the way how did we detect the bottleneck.
>>>>> 
>>>>> Piotrek
>>>>> 
>>>>>> On 3 Jun 2019, at 06:16, Becket Qin <becket....@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Piotr,
>>>>>> 
>>>>>> Thanks for the suggestion. Some thoughts below:
>>>>>> 
>>>>>> Re 1: The pending messages / bytes.
>>>>>> I completely agree these are very useful metrics and we should expect
>>> the
>>>>>> connector to report. WRT the way to expose them, it seems more
>>> consistent
>>>>>> to add two metrics instead of adding a method (unless there are other
>>> use
>>>>>> cases other than metric reporting). So we can add the following two
>>>>> metrics.
>>>>>> - pending.bytes, Gauge
>>>>>> - pending.messages, Gauge
>>>>>> Applicable connectors can choose to report them. These two metrics
>>> along
>>>>>> with latency should be sufficient for users to understand the progress
>>>>> of a
>>>>>> connector.
>>>>>> 
>>>>>> 
>>>>>> Re 2: Number of buffered data in-memory of the connector
>>>>>> If I understand correctly, this metric along with the pending mesages
>>> /
>>>>>> bytes would answer the questions of:
>>>>>> - Does the connector consume fast enough? Lagging behind + empty
>>> buffer
>>>>> =
>>>>>> cannot consume fast enough.
>>>>>> - Does the connector emit fast enough? Lagging behind + full buffer =
>>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow.
>>>>>> 
>>>>>> One feature we are currently working on to scale Flink automatically
>>>>> relies
>>>>>> on some metrics answering the same question, more specifically, we are
>>>>>> profiling the time spent on .next() method (time to consume) and the
>>> time
>>>>>> spent on .collect() method (time to emit / process). One advantage of
>>>>> such
>>>>>> method level time cost allows us to calculate the parallelism
>>> required to
>>>>>> keep up in case their is a lag.
>>>>>> 
>>>>>> However, one concern I have regarding such metric is that they are
>>>>>> implementation specific. Either profiling on the method time, or
>>>>> reporting
>>>>>> buffer usage assumes the connector are implemented in such a way. A
>>>>>> slightly better solution might be have the following metric:
>>>>>> 
>>>>>>   - EmitTimeRatio (or FetchTimeRatio): The time spent on emitting
>>>>>> records / Total time elapsed.
>>>>>> 
>>>>>> This assumes that the source connectors have to emit the records to
>>> the
>>>>>> downstream at some point. The emission may take some time ( e.g. go
>>>>> through
>>>>>> chained operators). And the rest of the time are spent to prepare the
>>>>>> record to emit, including time for consuming and format conversion,
>>> etc.
>>>>>> Ideally, we'd like to see the time spent on record fetch and emit to
>>> be
>>>>>> about the same, so no one is bottleneck for the other.
>>>>>> 
>>>>>> The downside of these time based metrics is additional overhead to get
>>>>> the
>>>>>> time, therefore sampling might be needed. But in practice I feel such
>>>>> time
>>>>>> based metric might be more useful if we want to take action.
>>>>>> 
>>>>>> 
>>>>>> I think we should absolutely add metrics in (1) to the metric
>>> convention.
>>>>>> We could also add the metrics mentioned in (2) if we reach consensus
>>> on
>>>>>> that. What do you think?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jiangjie (Becket) Qin
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski <pi...@ververica.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Hey Becket,
>>>>>>> 
>>>>>>> Re 1a) and 1b) +1 from my side.
>>>>>>> 
>>>>>>> I’ve discussed this issue:
>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2. It would be nice to have metrics, that allow us to check the
>>> cause
>>>>>>> of
>>>>>>>>>> back pressure:
>>>>>>>>>> a) for sources, length of input queue (in bytes? Or boolean
>>>>>>>>>> hasSomethingl/isEmpty)
>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean
>>>>>>>>>> hasSomething/isEmpty)
>>>>>>> 
>>>>>>> With Nico at some lengths and he also saw the benefits of them. We
>>> also
>>>>>>> have more concrete proposal for that.
>>>>>>> 
>>>>>>> Actually there are two really useful metrics, that we are missing
>>>>>>> currently:
>>>>>>> 
>>>>>>> 1. Number of data/records/bytes in the backlog to process. For
>>> example
>>>>>>> remaining number of bytes in unread files. Or pending data in Kafka
>>>>> topics.
>>>>>>> 2. Number of buffered data in-memory of the connector, that are
>>> waiting
>>>>> to
>>>>>>> be processed pushed to Flink pipeline.
>>>>>>> 
>>>>>>> Re 1:
>>>>>>> This would have to be a metric provided directly by a connector. It
>>>>> could
>>>>>>> be an undefined `int`:
>>>>>>> 
>>>>>>> `int backlog` - estimate of pending work.
>>>>>>> 
>>>>>>> “Undefined” meaning that it would be up to a connector to decided
>>>>> whether
>>>>>>> it’s measured in bytes, records, pending files or whatever it is
>>>>> possible
>>>>>>> to provide by the connector. This is because I assume not every
>>>>> connector
>>>>>>> can provide exact number and for some of them it might be impossible
>>> to
>>>>>>> provide records number of bytes count.
>>>>>>> 
>>>>>>> Re 2:
>>>>>>> This metric could be either provided by a connector, or calculated
>>>>> crudely
>>>>>>> by Flink:
>>>>>>> 
>>>>>>> `float bufferUsage` - value from [0.0, 1.0] range
>>>>>>> 
>>>>>>> Percentage of used in memory buffers, like in Kafka’s handover.
>>>>>>> 
>>>>>>> It could be crudely implemented by Flink with FLIP-27
>>>>>>> SourceReader#isAvailable. If SourceReader is not available reported
>>>>>>> `bufferUsage` could be 0.0. If it is available, it could be 1.0. I
>>> think
>>>>>>> this would be a good enough estimation for most of the use cases
>>> (that
>>>>>>> could be overloaded and implemented better if desired). Especially
>>>>> since we
>>>>>>> are reporting only probed values: if probed values are almost always
>>>>> “1.0”,
>>>>>>> it would mean that we have a back pressure. If they are almost always
>>>>>>> “0.0”, there is probably no back pressure at the sources.
>>>>>>> 
>>>>>>> What do you think about this?
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 30 May 2019, at 11:41, Becket Qin <becket....@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> Thanks a lot for all the feedback and comments. I'd like to continue
>>>>> the
>>>>>>>> discussion on this FLIP.
>>>>>>>> 
>>>>>>>> I updated the FLIP-33 wiki to remove all the histogram metrics from
>>> the
>>>>>>>> first version of metric convention due to the performance concern.
>>> The
>>>>>>> plan
>>>>>>>> is to introduce them later when we have a mechanism to opt in/out
>>>>>>> metrics.
>>>>>>>> At that point, users can decide whether they want to pay the cost to
>>>>> get
>>>>>>>> the metric or not.
>>>>>>>> 
>>>>>>>> As Stephan suggested, for this FLIP, let's first try to agree on the
>>>>>>> small
>>>>>>>> list of conventional metrics that connectors should follow.
>>>>>>>> Just to be clear, the purpose of the convention is not to enforce
>>> every
>>>>>>>> connector to report all these metrics, but to provide a guidance in
>>>>> case
>>>>>>>> these metrics are reported by some connectors.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> @ Stephan & Chesnay,
>>>>>>>> 
>>>>>>>> Regarding the duplication of `RecordsIn` metric in operator / task
>>>>>>>> IOMetricGroups, from what I understand, for source operator, it is
>>>>>>> actually
>>>>>>>> the SourceFunction that reports the operator level
>>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are essentially the
>>> same
>>>>>>>> metric in the operator level IOMetricGroup. Similarly for the Sink
>>>>>>>> operator, the operator level RecordsOut/RecordsOutPerSecond metrics
>>> are
>>>>>>>> also reported by the Sink function. I marked them as existing in the
>>>>>>>> FLIP-33 wiki page. Please let me know if I misunderstood.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin <becket....@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Piotr,
>>>>>>>>> 
>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>> 
>>>>>>>>> 1a) I guess you are referring to the part that "original system
>>>>> specific
>>>>>>>>> metrics should also be reported". The performance impact depends on
>>>>> the
>>>>>>>>> implementation. An efficient implementation would only record the
>>>>> metric
>>>>>>>>> once, but report them with two different metric names. This is
>>>>> unlikely
>>>>>>> to
>>>>>>>>> hurt performance.
>>>>>>>>> 
>>>>>>>>> 1b) Yes, I agree that we should avoid adding overhead to the
>>> critical
>>>>>>> path
>>>>>>>>> by all means. This is sometimes a tradeoff, running blindly without
>>>>> any
>>>>>>>>> metric gives best performance, but sometimes might be frustrating
>>> when
>>>>>>> we
>>>>>>>>> debug some issues.
>>>>>>>>> 
>>>>>>>>> 2. The metrics are indeed very useful. Are they supposed to be
>>>>> reported
>>>>>>> by
>>>>>>>>> the connectors or Flink itself? At this point FLIP-33 is more
>>> focused
>>>>> on
>>>>>>>>> provide a guidance to the connector authors on the metrics
>>> reporting.
>>>>>>> That
>>>>>>>>> said, after FLIP-27, I think we should absolutely report these
>>> metrics
>>>>>>> in
>>>>>>>>> the abstract implementation. In any case, the metric convention in
>>>>> this
>>>>>>>>> list are expected to evolve over time.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski <
>>> pi...@ververica.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the proposal and driving the effort here Becket :) I’ve
>>>>> read
>>>>>>>>>> through the FLIP-33 [1], and here are couple of my thoughts.
>>>>>>>>>> 
>>>>>>>>>> Big +1 for standardising the metric names between connectors, it
>>> will
>>>>>>>>>> definitely help us and users a lot.
>>>>>>>>>> 
>>>>>>>>>> Issues/questions/things to discuss that I’ve thought of:
>>>>>>>>>> 
>>>>>>>>>> 1a. If we are about to duplicate some metrics, can this become a
>>>>>>>>>> performance issue?
>>>>>>>>>> 1b. Generally speaking, we should make sure that collecting those
>>>>>>> metrics
>>>>>>>>>> is as non intrusive as possible, especially that they will need
>>> to be
>>>>>>>>>> updated once per record. (They might be collected more rarely with
>>>>> some
>>>>>>>>>> overhead, but the hot path of updating it per record will need to
>>> be
>>>>> as
>>>>>>>>>> quick as possible). That includes both avoiding heavy computation
>>> on
>>>>>>> per
>>>>>>>>>> record path: histograms?, measuring time for time based metrics
>>> (per
>>>>>>>>>> second) (System.currentTimeMillis() depending on the
>>> implementation
>>>>> can
>>>>>>>>>> invoke a system call)
>>>>>>>>>> 
>>>>>>>>>> 2. It would be nice to have metrics, that allow us to check the
>>> cause
>>>>>>> of
>>>>>>>>>> back pressure:
>>>>>>>>>> a) for sources, length of input queue (in bytes? Or boolean
>>>>>>>>>> hasSomethingl/isEmpty)
>>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean
>>>>>>>>>> hasSomething/isEmpty)
>>>>>>>>>> 
>>>>>>>>>> a) is useful in a scenario when we are processing backlog of
>>> records,
>>>>>>> all
>>>>>>>>>> of the internal Flink’s input/output network buffers are empty,
>>> and
>>>>> we
>>>>>>> want
>>>>>>>>>> to check whether the external source system is the bottleneck
>>>>> (source’s
>>>>>>>>>> input queue will be empty), or if the Flink’s connector is the
>>>>>>> bottleneck
>>>>>>>>>> (source’s input queues will be full).
>>>>>>>>>> b) similar story. Backlog of records, but this time all of the
>>>>> internal
>>>>>>>>>> Flink’s input/ouput network buffers are full, and we want o check
>>>>>>> whether
>>>>>>>>>> the external sink system is the bottleneck (sink output queues are
>>>>>>> full),
>>>>>>>>>> or if the Flink’s connector is the bottleneck (sink’s output
>>> queues
>>>>>>> will be
>>>>>>>>>> empty)
>>>>>>>>>> 
>>>>>>>>>> It might be sometimes difficult to provide those metrics, so they
>>>>> could
>>>>>>>>>> be optional, but if we could provide them, it would be really
>>>>> helpful.
>>>>>>>>>> 
>>>>>>>>>> Piotrek
>>>>>>>>>> 
>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics
>>>>>>>>>> <
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen <se...@apache.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I think this sounds reasonable.
>>>>>>>>>>> 
>>>>>>>>>>> Let's keep the "reconfiguration without stopping the job" out of
>>>>> this,
>>>>>>>>>>> because that would be a super big effort and if we approach that,
>>>>> then
>>>>>>>>>> in
>>>>>>>>>>> more generic way rather than specific to connector metrics.
>>>>>>>>>>> 
>>>>>>>>>>> I would suggest to look at the following things before starting
>>> with
>>>>>>> any
>>>>>>>>>>> implementation work:
>>>>>>>>>>> 
>>>>>>>>>>> - Try and find a committer to support this, otherwise it will be
>>>>> hard
>>>>>>>>>> to
>>>>>>>>>>> make progress
>>>>>>>>>>> - Start with defining a smaller set of "core metrics" and extend
>>> the
>>>>>>>>>> set
>>>>>>>>>>> later. I think that is easier than now blocking on reaching
>>>>> consensus
>>>>>>>>>> on a
>>>>>>>>>>> large group of metrics.
>>>>>>>>>>> - Find a solution to the problem Chesnay mentioned, that the
>>>>> "records
>>>>>>>>>> in"
>>>>>>>>>>> metric is somehow overloaded and exists already in the IO Metric
>>>>>>> group.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin <becket....@gmail.com
>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Stephan,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks a lot for the feedback. All makes sense.
>>>>>>>>>>>> 
>>>>>>>>>>>> It is a good suggestion to simply have an onRecord(numBytes,
>>>>>>> eventTime)
>>>>>>>>>>>> method for connector writers. It should meet most of the
>>>>>>> requirements,
>>>>>>>>>>>> individual
>>>>>>>>>>>> 
>>>>>>>>>>>> The configurable metrics feature is something really useful,
>>>>>>>>>> especially if
>>>>>>>>>>>> we can somehow make it dynamically configurable without stopping
>>>>> the
>>>>>>>>>> jobs.
>>>>>>>>>>>> It might be better to make it a separate discussion because it
>>> is a
>>>>>>>>>> more
>>>>>>>>>>>> generic feature instead of only for connectors.
>>>>>>>>>>>> 
>>>>>>>>>>>> So in order to make some progress, in this FLIP we can limit the
>>>>>>>>>> discussion
>>>>>>>>>>>> scope to the connector related items:
>>>>>>>>>>>> 
>>>>>>>>>>>> - the standard connector metric names and types.
>>>>>>>>>>>> - the abstract ConnectorMetricHandler interface
>>>>>>>>>>>> 
>>>>>>>>>>>> I'll start a separate thread to discuss other general metric
>>>>> related
>>>>>>>>>>>> enhancement items including:
>>>>>>>>>>>> 
>>>>>>>>>>>> - optional metrics
>>>>>>>>>>>> - dynamic metric configuration
>>>>>>>>>>>> - potential combination with rate limiter
>>>>>>>>>>>> 
>>>>>>>>>>>> Does this plan sound reasonable?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen <se...@apache.org>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Ignoring for a moment implementation details, this connector
>>>>> metrics
>>>>>>>>>> work
>>>>>>>>>>>>> is a really good thing to do, in my opinion
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The questions "oh, my job seems to be doing nothing, I am
>>> looking
>>>>> at
>>>>>>>>>> the
>>>>>>>>>>>> UI
>>>>>>>>>>>>> and the 'records in' value is still zero" is in the top three
>>>>>>> support
>>>>>>>>>>>>> questions I have been asked personally.
>>>>>>>>>>>>> Introspection into "how far is the consumer lagging behind"
>>> (event
>>>>>>>>>> time
>>>>>>>>>>>>> fetch latency) came up many times as well.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So big +1 to solving this problem.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> About the exact design - I would try to go for the following
>>>>>>>>>> properties:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - keep complexity of of connectors. Ideally the metrics handler
>>>>> has
>>>>>>> a
>>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or so, and
>>> everything
>>>>>>>>>> else is
>>>>>>>>>>>>> internal to the handler. That makes it dead simple for the
>>>>>>> connector.
>>>>>>>>>> We
>>>>>>>>>>>>> can also think of an extensive scheme for connector specific
>>>>>>> metrics.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - make it configurable on the job it cluster level which
>>> metrics
>>>>> the
>>>>>>>>>>>>> handler internally creates when that method is invoked.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Stephan
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay Schepler <
>>>>>>> ches...@apache.org
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As I said before, I believe this to be over-engineered and
>>> have
>>>>> no
>>>>>>>>>>>>>> interest in this implementation.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There are conceptual issues like defining a duplicate
>>>>>>>>>>>> numBytesIn(PerSec)
>>>>>>>>>>>>>> metric that already exists for each operator.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote:
>>>>>>>>>>>>>>> A few updates to the thread. I uploaded a patch[1] as a
>>> complete
>>>>>>>>>>>>>>> example of how users can use the metrics in the future.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Some thoughts below after taking a look at the
>>>>> AbstractMetricGroup
>>>>>>>>>>>> and
>>>>>>>>>>>>>>> its subclasses.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This patch intends to provide convenience for Flink connector
>>>>>>>>>>>>>>> implementations to follow metrics standards proposed in
>>> FLIP-33.
>>>>>>> It
>>>>>>>>>>>>>>> also try to enhance the metric management in general way to
>>> help
>>>>>>>>>>>> users
>>>>>>>>>>>>>>> with:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. metric definition
>>>>>>>>>>>>>>> 2. metric dependencies check
>>>>>>>>>>>>>>> 3. metric validation
>>>>>>>>>>>>>>> 4. metric control (turn on / off particular metrics)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the functionality of
>>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses. The
>>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the metric group
>>>>> hierarchy,
>>>>>>>>>> but
>>>>>>>>>>>>>>> does not really manage the metrics other than keeping them
>>> in a
>>>>>>> Map.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ideally we should only have one entry point for the metrics.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Right now the entry point is |AbstractMetricGroup|. However,
>>>>>>> besides
>>>>>>>>>>>>>>> the missing functionality mentioned above,
>>> |AbstractMetricGroup|
>>>>>>>>>>>> seems
>>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could extract it out to
>>>>>>>>>>>>>>> flink-metrics in order to use it for generic purpose. There
>>> will
>>>>>>> be
>>>>>>>>>>>>>>> some work, though.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Another approach is to make |AbstractMetrics| in this patch
>>> as
>>>>> the
>>>>>>>>>>>>>>> metric entry point. It wraps metric group and provides the
>>>>> missing
>>>>>>>>>>>>>>> functionalities. Then we can roll out this pattern to runtime
>>>>>>>>>>>>>>> components gradually as well.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> My first thought is that the latter approach gives a more
>>> smooth
>>>>>>>>>>>>>>> migration. But I am also OK with doing a refactoring on the
>>>>>>>>>>>>>>> |AbstractMetricGroup| family.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1] https://github.com/becketqin/flink/pull/1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin <
>>>>> becket....@gmail.com
>>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Chesnay,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It might be easier to discuss some implementation details in
>>>>> the
>>>>>>>>>>>>>>> PR review instead of in the FLIP discussion thread. I have a
>>>>>>>>>>>> patch
>>>>>>>>>>>>>>> for Kafka connectors ready but haven't submitted the PR yet.
>>>>>>>>>>>>>>> Hopefully that will help explain a bit more.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ** Re: metric type binding
>>>>>>>>>>>>>>> This is a valid point that worths discussing. If I understand
>>>>>>>>>>>>>>> correctly, there are two points:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. Metric type / interface does not matter as long as the
>>>>> metric
>>>>>>>>>>>>>>> semantic is clearly defined.
>>>>>>>>>>>>>>> Conceptually speaking, I agree that as long as the metric
>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>> is defined, metric type does not matter. To some extent,
>>> Gauge
>>>>> /
>>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can be think of as
>>> some
>>>>>>>>>>>>>>> well-recognized semantics, if you wish. In Flink, these
>>> metric
>>>>>>>>>>>>>>> semantics have their associated interface classes. In
>>> practice,
>>>>>>>>>>>>>>> such semantic to interface binding seems necessary for
>>>>> different
>>>>>>>>>>>>>>> components to communicate.  Simply standardize the semantic
>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> connector metrics seems not sufficient for people to build
>>>>>>>>>>>>>>> ecosystem on top of. At the end of the day, we still need to
>>>>>>> have
>>>>>>>>>>>>>>> some embodiment of the metric semantics that people can
>>> program
>>>>>>>>>>>>>>> against.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be exposed using
>>>>>>>>>>>>>>> different metric types / interfaces.
>>>>>>>>>>>>>>> This is a good point. Counter and Gauge-as-a-Counter are
>>> pretty
>>>>>>>>>>>>>>> much interchangeable. This is more of a trade-off between the
>>>>>>>>>>>> user
>>>>>>>>>>>>>>> experience of metric producers and consumers. The metric
>>>>>>>>>>>> producers
>>>>>>>>>>>>>>> want to use Counter or Gauge depending on whether the counter
>>>>> is
>>>>>>>>>>>>>>> already tracked in code, while ideally the metric consumers
>>>>> only
>>>>>>>>>>>>>>> want to see a single metric type for each metric. I am
>>> leaning
>>>>>>>>>>>>>>> towards to make the metric producers happy, i.e. allow Gauge
>>> /
>>>>>>>>>>>>>>> Counter metric type, and the the metric consumers handle the
>>>>>>> type
>>>>>>>>>>>>>>> variation. The reason is that in practice, there might be
>>> more
>>>>>>>>>>>>>>> connector implementations than metric reporter
>>> implementations.
>>>>>>>>>>>> We
>>>>>>>>>>>>>>> could also provide some helper method to facilitate reading
>>>>> from
>>>>>>>>>>>>>>> such variable metric type.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Just some quick replies to the comments around implementation
>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    4) single place where metrics are registered except
>>>>>>>>>>>>>>>    connector-specific
>>>>>>>>>>>>>>>    ones (which we can't really avoid).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Register connector specific ones in a single place is
>>> actually
>>>>>>>>>>>>>>> something that I want to achieve.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    2) I'm talking about time-series databases like
>>> Prometheus.
>>>>>>>>>>>> We
>>>>>>>>>>>>>>>    would
>>>>>>>>>>>>>>>    only have a gauge metric exposing the last
>>>>>>> fetchTime/emitTime
>>>>>>>>>>>>>>>    that is
>>>>>>>>>>>>>>>    regularly reported to the backend (Prometheus), where a
>>>>> user
>>>>>>>>>>>>>>>    could build
>>>>>>>>>>>>>>>    a histogram of his choosing when/if he wants it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Not sure if such downsampling works. As an example, if a user
>>>>>>>>>>>>>>> complains that there are some intermittent latency spikes
>>>>> (maybe
>>>>>>>>>>>> a
>>>>>>>>>>>>>>> few records in 10 seconds) in their processing system.
>>> Having a
>>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems unlikely useful.
>>>>>>>>>>>>>>> However by looking at actual 99.9 percentile latency might
>>>>> help.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay Schepler
>>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    Re: over complication of implementation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    I think I get understand better know what you're shooting
>>>>>>>>>>>> for,
>>>>>>>>>>>>>>>    effectively something like the OperatorIOMetricGroup.
>>>>>>>>>>>>>>>    But still, re-define setupConnectorMetrics() to accept a
>>>>> set
>>>>>>>>>>>>>>>    of flags
>>>>>>>>>>>>>>>    for counters/meters(ans _possibly_ histograms) along
>>> with a
>>>>>>>>>>>>>>>    set of
>>>>>>>>>>>>>>>    well-defined Optional<Gauge<?>>, and return the group.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    Solves all issues as far as i can tell:
>>>>>>>>>>>>>>>    1) no metrics must be created manually (except Gauges,
>>>>> which
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>    effectively just Suppliers and you can't get around
>>> this),
>>>>>>>>>>>>>>>    2) additional metrics can be registered on the returned
>>>>>>>>>>>> group,
>>>>>>>>>>>>>>>    3) see 1),
>>>>>>>>>>>>>>>    4) single place where metrics are registered except
>>>>>>>>>>>>>>>    connector-specific
>>>>>>>>>>>>>>>    ones (which we can't really avoid).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    Re: Histogram
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    1) As an example, whether "numRecordsIn" is exposed as a
>>>>>>>>>>>>>>>    Counter or a
>>>>>>>>>>>>>>>    Gauge should be irrelevant. So far we're using the metric
>>>>>>>>>>>> type
>>>>>>>>>>>>>>>    that is
>>>>>>>>>>>>>>>    the most convenient at exposing a given value. If there
>>> is
>>>>>>>>>>>>>>>    some backing
>>>>>>>>>>>>>>>    data-structure that we want to expose some data from we
>>>>>>>>>>>>>>>    typically opt
>>>>>>>>>>>>>>>    for a Gauge, as otherwise we're just mucking around with
>>>>> the
>>>>>>>>>>>>>>>    Meter/Counter API to get it to match. Similarly, if we
>>> want
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>    count
>>>>>>>>>>>>>>>    something but no current count exists we typically added
>>> a
>>>>>>>>>>>>>>>    Counter.
>>>>>>>>>>>>>>>    That's why attaching semantics to metric types makes
>>> little
>>>>>>>>>>>>>>>    sense (but
>>>>>>>>>>>>>>>    unfortunately several reporters already do it); for
>>>>>>>>>>>>>>>    counters/meters
>>>>>>>>>>>>>>>    certainly, but the majority of metrics are gauges.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    2) I'm talking about time-series databases like
>>> Prometheus.
>>>>>>>>>>>> We
>>>>>>>>>>>>>>>    would
>>>>>>>>>>>>>>>    only have a gauge metric exposing the last
>>>>>>> fetchTime/emitTime
>>>>>>>>>>>>>>>    that is
>>>>>>>>>>>>>>>    regularly reported to the backend (Prometheus), where a
>>>>> user
>>>>>>>>>>>>>>>    could build
>>>>>>>>>>>>>>>    a histogram of his choosing when/if he wants it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    On 22.02.2019 13:57, Becket Qin wrote:
>>>>>>>>>>>>>>>> Hi Chesnay,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ** Re: FLIP
>>>>>>>>>>>>>>>> I might have misunderstood this, but it seems that "major
>>>>>>>>>>>>>>>    changes" are well
>>>>>>>>>>>>>>>> defined in FLIP. The full contents is following:
>>>>>>>>>>>>>>>> What is considered a "major change" that needs a FLIP?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any of the following should be considered a major change:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Any major new feature, subsystem, or piece of
>>>>>>>>>>>>>>>    functionality
>>>>>>>>>>>>>>>> - *Any change that impacts the public interfaces of the
>>>>>>>>>>>>>>>    project*
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> What are the "public interfaces" of the project?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> *All of the following are public interfaces *that people
>>>>>>>>>>>>>>>    build around:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - DataStream and DataSet API, including classes related
>>>>>>>>>>>>>>>    to that, such as
>>>>>>>>>>>>>>>> StreamExecutionEnvironment
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Classes marked with the @Public annotation
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - On-disk binary formats, such as
>>>>>>>>>>>> checkpoints/savepoints
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - User-facing scripts/command-line tools, i.e.
>>>>>>>>>>>>>>>    bin/flink, Yarn scripts,
>>>>>>>>>>>>>>>> Mesos scripts
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Configuration settings
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - *Exposed monitoring information*
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So any monitoring information change is considered as
>>>>>>>>>>>> public
>>>>>>>>>>>>>>>    interface, and
>>>>>>>>>>>>>>>> any public interface change is considered as a "major
>>>>>>>>>>>>> change".
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ** Re: over complication of implementation.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Although this is more of implementation details that is not
>>>>>>>>>>>>>>>    covered by the
>>>>>>>>>>>>>>>> FLIP. But it may be worth discussing.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> First of all, I completely agree that we should use the
>>>>>>>>>>>>>>>    simplest way to
>>>>>>>>>>>>>>>> achieve our goal. To me the goal is the following:
>>>>>>>>>>>>>>>> 1. Clear connector conventions and interfaces.
>>>>>>>>>>>>>>>> 2. The easiness of creating a connector.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Both of them are important to the prosperity of the
>>>>>>>>>>>>>>>    connector ecosystem. So
>>>>>>>>>>>>>>>> I'd rather abstract as much as possible on our side to make
>>>>>>>>>>>>>>>    the connector
>>>>>>>>>>>>>>>> developer's work lighter. Given this goal, a static util
>>>>>>>>>>>>>>>    method approach
>>>>>>>>>>>>>>>> might have a few drawbacks:
>>>>>>>>>>>>>>>> 1. Users still have to construct the metrics by themselves.
>>>>>>>>>>>>>>>    (And note that
>>>>>>>>>>>>>>>> this might be erroneous by itself. For example, a customer
>>>>>>>>>>>>>>>    wrapper around
>>>>>>>>>>>>>>>> dropwizard meter maybe used instead of MeterView).
>>>>>>>>>>>>>>>> 2. When connector specific metrics are added, it is
>>>>>>>>>>>>>>>    difficult to enforce
>>>>>>>>>>>>>>>> the scope to be the same as standard metrics.
>>>>>>>>>>>>>>>> 3. It seems that a method proliferation is inevitable if we
>>>>>>>>>>>>>>>    want to apply
>>>>>>>>>>>>>>>> sanity checks. e.g. The metric of numBytesIn was not
>>>>>>>>>>>>>>>    registered for a meter.
>>>>>>>>>>>>>>>> 4. Metrics are still defined in random places and hard to
>>>>>>>>>>>>>> track.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The current PR I had was inspired by the Config system in
>>>>>>>>>>>>>>>    Kafka, which I
>>>>>>>>>>>>>>>> found pretty handy. In fact it is not only used by Kafka
>>>>>>>>>>>>>>>    itself but even
>>>>>>>>>>>>>>>> some other projects that depend on Kafka. I am not saying
>>>>>>>>>>>>>>>    this approach is
>>>>>>>>>>>>>>>> perfect. But I think it worths to save the work for
>>>>>>>>>>>>>>>    connector writers and
>>>>>>>>>>>>>>>> encourage more systematic implementation. That being said,
>>>>>>>>>>>> I
>>>>>>>>>>>>>>>    am fully open
>>>>>>>>>>>>>>>> to suggestions.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Re: Histogram
>>>>>>>>>>>>>>>> I think there are two orthogonal questions around those
>>>>>>>>>>>>>> metrics:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just looking at the
>>>>>>>>>>>>>>>    meaning of a
>>>>>>>>>>>>>>>> metric, is generic to all connectors? If the answer is yes,
>>>>>>>>>>>>>>>    we should
>>>>>>>>>>>>>>>> include the metric into the convention. No matter whether
>>>>>>>>>>>> we
>>>>>>>>>>>>>>>    include it
>>>>>>>>>>>>>>>> into the convention or not, some connector implementations
>>>>>>>>>>>>>>>    will emit such
>>>>>>>>>>>>>>>> metric. It is better to have a convention than letting each
>>>>>>>>>>>>>>>    connector do
>>>>>>>>>>>>>>>> random things.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2. If a standard metric is a histogram, what should we do?
>>>>>>>>>>>>>>>> I agree that we should make it clear that using histograms
>>>>>>>>>>>>>>>    will have
>>>>>>>>>>>>>>>> performance risk. But I do see histogram is useful in some
>>>>>>>>>>>>>>>    fine-granularity
>>>>>>>>>>>>>>>> debugging where one do not have the luxury to stop the
>>>>>>>>>>>>>>>    system and inject
>>>>>>>>>>>>>>>> more inspection code. So the workaround I am thinking is to
>>>>>>>>>>>>>>>    provide some
>>>>>>>>>>>>>>>> implementation suggestions. Assume later on we have a
>>>>>>>>>>>>>>>    mechanism of
>>>>>>>>>>>>>>>> selective metrics. In the abstract metrics class we can
>>>>>>>>>>>>>>>    disable those
>>>>>>>>>>>>>>>> metrics by default individual connector writers does not
>>>>>>>>>>>>>>>    have to do
>>>>>>>>>>>>>>>> anything (this is another advantage of having an
>>>>>>>>>>>>>>>    AbstractMetrics instead of
>>>>>>>>>>>>>>>> static util methods.)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am not sure I fully understand the histogram in the
>>>>>>>>>>>>>>>    backend approach. Can
>>>>>>>>>>>>>>>> you explain a bit more? Do you mean emitting the raw data,
>>>>>>>>>>>>>>>    e.g. fetchTime
>>>>>>>>>>>>>>>> and emitTime with each record and let the histogram
>>>>>>>>>>>>>>>    computation happen in
>>>>>>>>>>>>>>>> the background? Or let the processing thread putting the
>>>>>>>>>>>>>>>    values into a
>>>>>>>>>>>>>>>> queue and have a separate thread polling from the queue and
>>>>>>>>>>>>>>>    add them into
>>>>>>>>>>>>>>>> the histogram?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler
>>>>>>>>>>>>>>>    <ches...@apache.org <mailto:ches...@apache.org>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Re: Flip
>>>>>>>>>>>>>>>>> The very first line under both the main header and Purpose
>>>>>>>>>>>>>>>    section
>>>>>>>>>>>>>>>>> describe Flips as "major changes", which this isn't.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Re: complication
>>>>>>>>>>>>>>>>> I'm not arguing against standardization, but again an
>>>>>>>>>>>>>>>    over-complicated
>>>>>>>>>>>>>>>>> implementation when a static utility method would be
>>>>>>>>>>>>>>>    sufficient.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> public static void setupConnectorMetrics(
>>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup,
>>>>>>>>>>>>>>>>> String connectorName,
>>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn,
>>>>>>>>>>>>>>>>> ...)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This gives you all you need:
>>>>>>>>>>>>>>>>> * a well-defined set of metrics for a connector to opt-in
>>>>>>>>>>>>>>>>> * standardized naming schemes for scope and individual
>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> * standardize metric types (although personally I'm not
>>>>>>>>>>>>>>>    interested in that
>>>>>>>>>>>>>>>>> since metric types should be considered syntactic sugar)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Re: Configurable Histogram
>>>>>>>>>>>>>>>>> If anything they _must_ be turned off by default, but the
>>>>>>>>>>>>>>>    metric system is
>>>>>>>>>>>>>>>>> already exposing so many options that I'm not too keen on
>>>>>>>>>>>>>>>    adding even more.
>>>>>>>>>>>>>>>>> You have also only addressed my first argument against
>>>>>>>>>>>>>>>    histograms
>>>>>>>>>>>>>>>>> (performance), the second one still stands (calculate
>>>>>>>>>>>>>>>    histogram in metric
>>>>>>>>>>>>>>>>> backends instead).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote:
>>>>>>>>>>>>>>>>>> Hi Chesnay,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is worthy of a FLIP
>>>>>>>>>>>>>>>    because it is
>>>>>>>>>>>>>>>>>> public API. According to the FLIP description a FlIP is
>>>>>>>>>>>>>>>    required in case
>>>>>>>>>>>>>>>>> of:
>>>>>>>>>>>>>>>>>> - Any change that impacts the public interfaces of
>>>>>>>>>>>>>>>    the project
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> and the following entry is found in the definition of
>>>>>>>>>>>>>>>    "public interface".
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> - Exposed monitoring information
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Metrics are critical to any production system. So a clear
>>>>>>>>>>>>>>>    metric
>>>>>>>>>>>>>>>>> definition
>>>>>>>>>>>>>>>>>> is important for any serious users. For an organization
>>>>>>>>>>>>>>>    with large Flink
>>>>>>>>>>>>>>>>>> installation, change in metrics means great amount of
>>>>>>>>>>>>>>>    work. So such
>>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>>>> do need to be fully discussed and documented.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ** Re: Histogram.
>>>>>>>>>>>>>>>>>> We can discuss whether there is a better way to expose
>>>>>>>>>>>>>>>    metrics that are
>>>>>>>>>>>>>>>>>> suitable for histograms. My micro-benchmark on various
>>>>>>>>>>>>>>>    histogram
>>>>>>>>>>>>>>>>>> implementations also indicates that they are
>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>    slower than
>>>>>>>>>>>>>>>>>> other metric types. But I don't think that means never
>>>>>>>>>>>> use
>>>>>>>>>>>>>>>    histogram, but
>>>>>>>>>>>>>>>>>> means use it with caution. For example, we can suggest
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> implementations
>>>>>>>>>>>>>>>>>> to turn them off by default and only turn it on for a
>>>>>>>>>>>>>>>    small amount of
>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>> when performing some micro-debugging.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ** Re: complication:
>>>>>>>>>>>>>>>>>> Connector conventions are essential for Flink ecosystem.
>>>>>>>>>>>>>>>    Flink connectors
>>>>>>>>>>>>>>>>>> pool is probably the most important part of Flink, just
>>>>>>>>>>>>>>>    like any other
>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>> system. Clear conventions of connectors will help build
>>>>>>>>>>>>>>>    Flink ecosystem
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> a more organic way.
>>>>>>>>>>>>>>>>>> Take the metrics convention as an example, imagine
>>>>>>>>>>>> someone
>>>>>>>>>>>>>>>    has developed
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> Flink connector for System foo, and another developer may
>>>>>>>>>>>>>>>    have developed
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for Flink which
>>>>>>>>>>>>>>>    analyzes the Flink
>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>> performance based on metrics. With a clear metric
>>>>>>>>>>>>>>>    convention, those two
>>>>>>>>>>>>>>>>>> projects could be developed independently. Once users put
>>>>>>>>>>>>>>>    them together,
>>>>>>>>>>>>>>>>>> it would work without additional modifications. This
>>>>>>>>>>>>>>>    cannot be easily
>>>>>>>>>>>>>>>>>> achieved by just defining a few constants.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ** Re: selective metrics:
>>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate thread.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> @Dawid
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency
>>>>>>>>>>>>>>>>>> The primary purpose of establish such a convention is to
>>>>>>>>>>>>>>>    help developers
>>>>>>>>>>>>>>>>>> write connectors in a more compatible way. The convention
>>>>>>>>>>>>>>>    is supposed to
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> defined more proactively. So when look at the convention,
>>>>>>>>>>>>>>>    it seems more
>>>>>>>>>>>>>>>>>> important to see if the concept is applicable to
>>>>>>>>>>>>>>>    connectors in general.
>>>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>>> might be true so far only Kafka connector reports
>>>>>>>>>>>> latency.
>>>>>>>>>>>>>>>    But there
>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>> be hundreds of other connector implementations in the
>>>>>>>>>>>>>>>    Flink ecosystem,
>>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of them also emits
>>>>>>>>>>>>>>>    latency. I
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> a lot of other sources actually also has an append
>>>>>>>>>>>>>>>    timestamp. e.g.
>>>>>>>>>>>>>>>>> database
>>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I wouldn't be surprised
>>>>>>>>>>>>>>>    if some database
>>>>>>>>>>>>>>>>>> connector can also emit latency metrics.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler
>>>>>>>>>>>>>>>    <ches...@apache.org <mailto:ches...@apache.org>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to investigate this
>>>>>>>>>>>> as
>>>>>>>>>>>>>>>    part of this
>>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of interest for
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>    entire metric
>>>>>>>>>>>>>>>>>>> system, and should be designed for as such.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to the core of
>>>>>>>>>>>>>>>    Flink. They are
>>>>>>>>>>>>>>>>>>> significantly more expensive than other metrics, and
>>>>>>>>>>>>>>>    calculating
>>>>>>>>>>>>>>>>>>> histograms in the application is regarded as an
>>>>>>>>>>>>>>>    anti-pattern by several
>>>>>>>>>>>>>>>>>>> metric backends, who instead recommend to expose the raw
>>>>>>>>>>>>>>>    data and
>>>>>>>>>>>>>>>>>>> calculate the histogram in the backend.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Second, this seems overly complicated. Given that we
>>>>>>>>>>>>>>>    already established
>>>>>>>>>>>>>>>>>>> that not all connectors will export all metrics we are
>>>>>>>>>>>>>>>    effectively
>>>>>>>>>>>>>>>>>>> reducing this down to a consistent naming scheme. We
>>>>>>>>>>>>>>>    don't need anything
>>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a few constants
>>>>>>>>>>>>>>>    that all
>>>>>>>>>>>>>>>>>>> connectors use.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of a FLIP.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz wrote:
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I agree. But those
>>>>>>>>>>>>>>>    particular metrics
>>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now would only be
>>>>>>>>>>>>>>>    reported if user uses
>>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal timestampAssigner with
>>>>>>>>>>>>>>>    StreamCharacteristic
>>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds like a very
>>>>>>>>>>>>>>>    specific case. I am
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic metric that will
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> disabled/absent for most of implementations.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal issue, that might
>>>>>>>>>>>>>>>    make sense to
>>>>>>>>>>>>>>>>>>>> investigate in the future.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Dawid
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote:
>>>>>>>>>>>>>>>>>>>>> Hi Dawid,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes sense to me. There
>>>>>>>>>>>>>>>    are two cases
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> addressed.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a guidance. It is
>>>>>>>>>>>>>>>    likely that a
>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>> only supports some but not all of the metrics. In that
>>>>>>>>>>>>>>>    case, each
>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>> implementation should have the freedom to decide which
>>>>>>>>>>>>>>>    metrics are
>>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are supported, the
>>>>>>>>>>>>>>>    guidance should be
>>>>>>>>>>>>>>>>>>>>> followed.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable certain metrics
>>>>>>>>>>>>>>>    for some reason
>>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of data). A generic
>>>>>>>>>>>>>>>    mechanism should
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> provided to allow user choose which metrics are
>>>>>>>>>>>>>>>    reported. This
>>>>>>>>>>>>>>>>> mechanism
>>>>>>>>>>>>>>>>>>>>> should also be honored by the connector
>>>>>>>>>>>> implementations.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz <
>>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto:dwysakow...@apache.org>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a unified,
>>>>>>>>>>>>>>>    standard set of
>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight concerns about
>>>>>>>>>>>>>>>    fetchLatency and
>>>>>>>>>>>>>>>>>>>>>> latency though. They are computed based on EventTime
>>>>>>>>>>>>>>>    which is not a
>>>>>>>>>>>>>>>>>>> purely
>>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often on some business
>>>>>>>>>>>>>>>    logic, might be
>>>>>>>>>>>>>>>>>>> absent
>>>>>>>>>>>>>>>>>>>>>> or defined after source. Those metrics could also
>>>>>>>>>>>>>>>    behave in a weird
>>>>>>>>>>>>>>>>>>> way in
>>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore I am not sure if
>>>>>>>>>>>>>>>    we should
>>>>>>>>>>>>>>>>> include
>>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we could at least
>>>>>>>>>>>>>>>    introduce a feature
>>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Dawid
>>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to the proposed
>>>>>>>>>>>>>>>    metrics. I'll start a
>>>>>>>>>>>>>>>>>>>>>> voting thread later toady.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin
>>>>>>>>>>>>>>>    <becket....@gmail.com <mailto:becket....@gmail.com>> <
>>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto:becket....@gmail.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP discussion thread
>>>>>>>>>>>> about
>>>>>>>>>>>>>>>    standardize
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> connector metrics.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a convention of
>>>>>>>>>>>>>>>    Flink connector
>>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the monitoring and
>>>>>>>>>>>>>>>    alerting on Flink
>>>>>>>>>>>>>>>>>>> jobs.
>>>>>>>>>>>>>>>>>>>>>> The FLIP link is following:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>>

Re: [DISCUSS] FLIP-33: Standardize connector metrics

Reply via email to