In any case, we probably have to make some change to the existing
MetricGroup interface. We can discuss that in this thread as well.

On Thu, Mar 21, 2019 at 1:13 PM Becket Qin <becket....@gmail.com> wrote:

> A few updates to the thread. I uploaded a patch[1] as a complete example
> of how users can use the metrics in the future.
>
> Some thoughts below after taking a look at the AbstractMetricGroup and its
> subclasses.
>
> This patch intends to provide convenience for Flink connector
> implementations to follow metrics standards proposed in FLIP-33. It also
> try to enhance the metric management in general way to help users with:
>
>    1. metric definition
>    2. metric dependencies check
>    3. metric validation
>    4. metric control (turn on / off particular metrics)
>
> This patch wraps MetricGroup to extend the functionality of
> AbstractMetricGroup and its subclasses. The AbstractMetricGroup mainly
> focus on the metric group hierarchy, but does not really manage the metrics
> other than keeping them in a Map.
>
> Ideally we should only have one entry point for the metrics.
>
> Right now the entry point is AbstractMetricGroup. However, besides the
> missing functionality mentioned above, AbstractMetricGroup seems deeply
> rooted in Flink runtime. We could extract it out to flink-metrics in order
> to use it for generic purpose. There will be some work, though.
>
> Another approach is to make AbstractMetrics in this patch as the metric
> entry point. It wraps metric group and provides the missing
> functionalities. Then we can roll out this pattern to runtime components
> gradually as well.
>
> My first thought is that the latter approach gives a more smooth
> migration. But I am also OK with doing a refactoring on the
> AbstractMetricGroup family.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> [1] https://github.com/becketqin/flink/pull/1
>
> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin <becket....@gmail.com> wrote:
>
>> Hi Chesnay,
>>
>> It might be easier to discuss some implementation details in the PR
>> review instead of in the FLIP discussion thread. I have a patch for Kafka
>> connectors ready but haven't submitted the PR yet. Hopefully that will help
>> explain a bit more.
>>
>> ** Re: metric type binding
>> This is a valid point that worths discussing. If I understand correctly,
>> there are two points:
>>
>> 1. Metric type / interface does not matter as long as the metric semantic
>> is clearly defined.
>> Conceptually speaking, I agree that as long as the metric semantic is
>> defined, metric type does not matter. To some extent, Gauge / Counter /
>> Meter / Histogram themselves can be think of as some well-recognized
>> semantics, if you wish. In Flink, these metric semantics have their
>> associated interface classes. In practice, such semantic to interface
>> binding seems necessary for different components to communicate.  Simply
>> standardize the semantic of the connector metrics seems not sufficient for
>> people to build ecosystem on top of. At the end of the day, we still need
>> to have some embodiment of the metric semantics that people can program
>> against.
>>
>> 2. Sometimes the same metric semantic can be exposed using different
>> metric types / interfaces.
>> This is a good point. Counter and Gauge-as-a-Counter are pretty much
>> interchangeable. This is more of a trade-off between the user experience of
>> metric producers and consumers. The metric producers want to use Counter or
>> Gauge depending on whether the counter is already tracked in code, while
>> ideally the metric consumers only want to see a single metric type for each
>> metric. I am leaning towards to make the metric producers happy, i.e. allow
>> Gauge / Counter metric type, and the the metric consumers handle the type
>> variation. The reason is that in practice, there might be more connector
>> implementations than metric reporter implementations. We could also provide
>> some helper method to facilitate reading from such variable metric type.
>>
>>
>> Just some quick replies to the comments around implementation details.
>>
>>> 4) single place where metrics are registered except connector-specific
>>> ones (which we can't really avoid).
>>
>> Register connector specific ones in a single place is actually something
>> that I want to achieve.
>>
>> 2) I'm talking about time-series databases like Prometheus. We would
>>> only have a gauge metric exposing the last fetchTime/emitTime that is
>>> regularly reported to the backend (Prometheus), where a user could build
>>> a histogram of his choosing when/if he wants it.
>>>
>> Not sure if such downsampling works. As an example, if a user complains
>> that there are some intermittent latency spikes (maybe a few records in 10
>> seconds) in their processing system. Having a Gauge sampling instantaneous
>> latency seems unlikely useful. However by looking at actual 99.9 percentile
>> latency might help.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay Schepler <ches...@apache.org>
>> wrote:
>>
>>> Re: over complication of implementation.
>>>
>>> I think I get understand better know what you're shooting for,
>>> effectively something like the OperatorIOMetricGroup.
>>> But still, re-define setupConnectorMetrics() to accept a set of flags
>>> for counters/meters(ans _possibly_ histograms) along with a set of
>>> well-defined Optional<Gauge<?>>, and return the group.
>>>
>>> Solves all issues as far as i can tell:
>>> 1) no metrics must be created manually (except Gauges, which are
>>> effectively just Suppliers and you can't get around this),
>>> 2) additional metrics can be registered on the returned group,
>>> 3) see 1),
>>> 4) single place where metrics are registered except connector-specific
>>> ones (which we can't really avoid).
>>>
>>> Re: Histogram
>>>
>>> 1) As an example, whether "numRecordsIn" is exposed as a Counter or a
>>> Gauge should be irrelevant. So far we're using the metric type that is
>>> the most convenient at exposing a given value. If there is some backing
>>> data-structure that we want to expose some data from we typically opt
>>> for a Gauge, as otherwise we're just mucking around with the
>>> Meter/Counter API to get it to match. Similarly, if we want to count
>>> something but no current count exists we typically added a Counter.
>>> That's why attaching semantics to metric types makes little sense (but
>>> unfortunately several reporters already do it); for counters/meters
>>> certainly, but the majority of metrics are gauges.
>>>
>>> 2) I'm talking about time-series databases like Prometheus. We would
>>> only have a gauge metric exposing the last fetchTime/emitTime that is
>>> regularly reported to the backend (Prometheus), where a user could build
>>> a histogram of his choosing when/if he wants it.
>>>
>>> On 22.02.2019 13:57, Becket Qin wrote:
>>> > Hi Chesnay,
>>> >
>>> > Thanks for the explanation.
>>> >
>>> > ** Re: FLIP
>>> > I might have misunderstood this, but it seems that "major changes" are
>>> well
>>> > defined in FLIP. The full contents is following:
>>> > What is considered a "major change" that needs a FLIP?
>>> >
>>> > Any of the following should be considered a major change:
>>> >
>>> >     - Any major new feature, subsystem, or piece of functionality
>>> >     - *Any change that impacts the public interfaces of the project*
>>> >
>>> > What are the "public interfaces" of the project?
>>> >
>>> >
>>> >
>>> > *All of the following are public interfaces *that people build around:
>>> >
>>> >     - DataStream and DataSet API, including classes related to that,
>>> such as
>>> >     StreamExecutionEnvironment
>>> >
>>> >
>>> >     - Classes marked with the @Public annotation
>>> >
>>> >
>>> >     - On-disk binary formats, such as checkpoints/savepoints
>>> >
>>> >
>>> >     - User-facing scripts/command-line tools, i.e. bin/flink, Yarn
>>> scripts,
>>> >     Mesos scripts
>>> >
>>> >
>>> >     - Configuration settings
>>> >
>>> >
>>> >     - *Exposed monitoring information*
>>> >
>>> >
>>> > So any monitoring information change is considered as public
>>> interface, and
>>> > any public interface change is considered as a "major change".
>>> >
>>> >
>>> > ** Re: over complication of implementation.
>>> >
>>> > Although this is more of implementation details that is not covered by
>>> the
>>> > FLIP. But it may be worth discussing.
>>> >
>>> > First of all, I completely agree that we should use the simplest way to
>>> > achieve our goal. To me the goal is the following:
>>> > 1. Clear connector conventions and interfaces.
>>> > 2. The easiness of creating a connector.
>>> >
>>> > Both of them are important to the prosperity of the connector
>>> ecosystem. So
>>> > I'd rather abstract as much as possible on our side to make the
>>> connector
>>> > developer's work lighter. Given this goal, a static util method
>>> approach
>>> > might have a few drawbacks:
>>> > 1. Users still have to construct the metrics by themselves. (And note
>>> that
>>> > this might be erroneous by itself. For example, a customer wrapper
>>> around
>>> > dropwizard meter maybe used instead of MeterView).
>>> > 2. When connector specific metrics are added, it is difficult to
>>> enforce
>>> > the scope to be the same as standard metrics.
>>> > 3. It seems that a method proliferation is inevitable if we want to
>>> apply
>>> > sanity checks. e.g. The metric of numBytesIn was not registered for a
>>> meter.
>>> > 4. Metrics are still defined in random places and hard to track.
>>> >
>>> > The current PR I had was inspired by the Config system in Kafka, which
>>> I
>>> > found pretty handy. In fact it is not only used by Kafka itself but
>>> even
>>> > some other projects that depend on Kafka. I am not saying this
>>> approach is
>>> > perfect. But I think it worths to save the work for connector writers
>>> and
>>> > encourage more systematic implementation. That being said, I am fully
>>> open
>>> > to suggestions.
>>> >
>>> >
>>> > Re: Histogram
>>> > I think there are two orthogonal questions around those metrics:
>>> >
>>> > 1. Regardless of the metric type, by just looking at the meaning of a
>>> > metric, is generic to all connectors? If the answer is yes, we should
>>> > include the metric into the convention. No matter whether we include it
>>> > into the convention or not, some connector implementations will emit
>>> such
>>> > metric. It is better to have a convention than letting each connector
>>> do
>>> > random things.
>>> >
>>> > 2. If a standard metric is a histogram, what should we do?
>>> > I agree that we should make it clear that using histograms will have
>>> > performance risk. But I do see histogram is useful in some
>>> fine-granularity
>>> > debugging where one do not have the luxury to stop the system and
>>> inject
>>> > more inspection code. So the workaround I am thinking is to provide
>>> some
>>> > implementation suggestions. Assume later on we have a mechanism of
>>> > selective metrics. In the abstract metrics class we can disable those
>>> > metrics by default individual connector writers does not have to do
>>> > anything (this is another advantage of having an AbstractMetrics
>>> instead of
>>> > static util methods.)
>>> >
>>> > I am not sure I fully understand the histogram in the backend
>>> approach. Can
>>> > you explain a bit more? Do you mean emitting the raw data, e.g.
>>> fetchTime
>>> > and emitTime with each record and let the histogram computation happen
>>> in
>>> > the background? Or let the processing thread putting the values into a
>>> > queue and have a separate thread polling from the queue and add them
>>> into
>>> > the histogram?
>>> >
>>> > Thanks,
>>> >
>>> > Jiangjie (Becket) Qin
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler <ches...@apache.org>
>>> wrote:
>>> >
>>> >> Re: Flip
>>> >> The very first line under both the main header and Purpose section
>>> >> describe Flips as "major changes", which this isn't.
>>> >>
>>> >> Re: complication
>>> >> I'm not arguing against standardization, but again an over-complicated
>>> >> implementation when a static utility method would be sufficient.
>>> >>
>>> >> public static void setupConnectorMetrics(
>>> >> MetricGroup operatorMetricGroup,
>>> >> String connectorName,
>>> >> Optional<Gauge<Long>> numRecordsIn,
>>> >> ...)
>>> >>
>>> >> This gives you all you need:
>>> >> * a well-defined set of metrics for a connector to opt-in
>>> >> * standardized naming schemes for scope and individual metrics
>>> >> * standardize metric types (although personally I'm not interested in
>>> that
>>> >> since metric types should be considered syntactic sugar)
>>> >>
>>> >> Re: Configurable Histogram
>>> >> If anything they _must_ be turned off by default, but the metric
>>> system is
>>> >> already exposing so many options that I'm not too keen on adding even
>>> more.
>>> >> You have also only addressed my first argument against histograms
>>> >> (performance), the second one still stands (calculate histogram in
>>> metric
>>> >> backends instead).
>>> >>
>>> >> On 21.02.2019 16:27, Becket Qin wrote:
>>> >>> Hi Chesnay,
>>> >>>
>>> >>> Thanks for the comments. I think this is worthy of a FLIP because it
>>> is
>>> >>> public API. According to the FLIP description a FlIP is required in
>>> case
>>> >> of:
>>> >>>      - Any change that impacts the public interfaces of the project
>>> >>>
>>> >>> and the following entry is found in the definition of "public
>>> interface".
>>> >>>
>>> >>>      - Exposed monitoring information
>>> >>>
>>> >>> Metrics are critical to any production system. So a clear metric
>>> >> definition
>>> >>> is important for any serious users. For an organization with large
>>> Flink
>>> >>> installation, change in metrics means great amount of work. So such
>>> >> changes
>>> >>> do need to be fully discussed and documented.
>>> >>>
>>> >>> ** Re: Histogram.
>>> >>> We can discuss whether there is a better way to expose metrics that
>>> are
>>> >>> suitable for histograms. My micro-benchmark on various histogram
>>> >>> implementations also indicates that they are significantly slower
>>> than
>>> >>> other metric types. But I don't think that means never use
>>> histogram, but
>>> >>> means use it with caution. For example, we can suggest the
>>> >> implementations
>>> >>> to turn them off by default and only turn it on for a small amount of
>>> >> time
>>> >>> when performing some micro-debugging.
>>> >>>
>>> >>> ** Re: complication:
>>> >>> Connector conventions are essential for Flink ecosystem. Flink
>>> connectors
>>> >>> pool is probably the most important part of Flink, just like any
>>> other
>>> >> data
>>> >>> system. Clear conventions of connectors will help build Flink
>>> ecosystem
>>> >> in
>>> >>> a more organic way.
>>> >>> Take the metrics convention as an example, imagine someone has
>>> developed
>>> >> a
>>> >>> Flink connector for System foo, and another developer may have
>>> developed
>>> >> a
>>> >>> monitoring and diagnostic framework for Flink which analyzes the
>>> Flink
>>> >> job
>>> >>> performance based on metrics. With a clear metric convention, those
>>> two
>>> >>> projects could be developed independently.  Once users put them
>>> together,
>>> >>> it would work without additional modifications. This cannot be easily
>>> >>> achieved by just defining a few constants.
>>> >>>
>>> >>> ** Re: selective metrics:
>>> >>> Sure, we can discuss that in a separate thread.
>>> >>>
>>> >>> @Dawid
>>> >>>
>>> >>> ** Re: latency / fetchedLatency
>>> >>> The primary purpose of establish such a convention is to help
>>> developers
>>> >>> write connectors in a more compatible way. The convention is
>>> supposed to
>>> >> be
>>> >>> defined more proactively. So when look at the convention, it seems
>>> more
>>> >>> important to see if the concept is applicable to connectors in
>>> general.
>>> >> It
>>> >>> might be true so far only Kafka connector reports latency. But there
>>> >> might
>>> >>> be hundreds of other connector implementations in the Flink
>>> ecosystem,
>>> >>> though not in the Flink repo, and some of them also emits latency. I
>>> >> think
>>> >>> a lot of other sources actually also has an append timestamp. e.g.
>>> >> database
>>> >>> bin logs and some K-V stores. So I wouldn't be surprised if some
>>> database
>>> >>> connector can also emit latency metrics.
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Jiangjie (Becket) Qin
>>> >>>
>>> >>>
>>> >>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler <
>>> ches...@apache.org>
>>> >>> wrote:
>>> >>>
>>> >>>> Regarding 2) It doesn't make sense to investigate this as part of
>>> this
>>> >>>> FLIP. This is something that could be of interest for the entire
>>> metric
>>> >>>> system, and should be designed for as such.
>>> >>>>
>>> >>>> Regarding the proposal as a whole:
>>> >>>>
>>> >>>> Histogram metrics shall not be added to the core of Flink. They are
>>> >>>> significantly more expensive than other metrics, and calculating
>>> >>>> histograms in the application is regarded as an anti-pattern by
>>> several
>>> >>>> metric backends, who instead recommend to expose the raw data and
>>> >>>> calculate the histogram in the backend.
>>> >>>>
>>> >>>> Second, this seems overly complicated. Given that we already
>>> established
>>> >>>> that not all connectors will export all metrics we are effectively
>>> >>>> reducing this down to a consistent naming scheme. We don't need
>>> anything
>>> >>>> sophisticated for that; basically just a few constants that all
>>> >>>> connectors use.
>>> >>>>
>>> >>>> I'm not convinced that this is worthy of a FLIP.
>>> >>>>
>>> >>>> On 21.02.2019 14:26, Dawid Wysakowicz wrote:
>>> >>>>> Hi,
>>> >>>>>
>>> >>>>> Ad 1. In general I undestand and I agree. But those particular
>>> metrics
>>> >>>>> (latency, fetchLatency), right now would only be reported if user
>>> uses
>>> >>>>> KafkaConsumer with internal timestampAssigner with
>>> StreamCharacteristic
>>> >>>>> set to EventTime, right? That sounds like a very specific case. I
>>> am
>>> >> not
>>> >>>>> sure if we should introduce a generic metric that will be
>>> >>>>> disabled/absent for most of implementations.
>>> >>>>>
>>> >>>>> Ad.2 That sounds like an orthogonal issue, that might make sense to
>>> >>>>> investigate in the future.
>>> >>>>>
>>> >>>>> Best,
>>> >>>>>
>>> >>>>> Dawid
>>> >>>>>
>>> >>>>> On 21/02/2019 13:20, Becket Qin wrote:
>>> >>>>>> Hi Dawid,
>>> >>>>>>
>>> >>>>>> Thanks for the feedback. That makes sense to me. There are two
>>> cases
>>> >> to
>>> >>>> be
>>> >>>>>> addressed.
>>> >>>>>>
>>> >>>>>> 1. The metrics are supposed to be a guidance. It is likely that a
>>> >>>> connector
>>> >>>>>> only supports some but not all of the metrics. In that case, each
>>> >>>> connector
>>> >>>>>> implementation should have the freedom to decide which metrics are
>>> >>>>>> reported. For the metrics that are supported, the guidance should
>>> be
>>> >>>>>> followed.
>>> >>>>>>
>>> >>>>>> 2. Sometimes users may want to disable certain metrics for some
>>> reason
>>> >>>>>> (e.g. performance / reprocessing of data). A generic mechanism
>>> should
>>> >> be
>>> >>>>>> provided to allow user choose which metrics are reported. This
>>> >> mechanism
>>> >>>>>> should also be honored by the connector implementations.
>>> >>>>>>
>>> >>>>>> Does this sound reasonable to you?
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>>
>>> >>>>>> Jiangjie (Becket) Qin
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz <
>>> >>>> dwysakow...@apache.org>
>>> >>>>>> wrote:
>>> >>>>>>
>>> >>>>>>> Hi,
>>> >>>>>>>
>>> >>>>>>> Generally I like the idea of having a unified, standard set of
>>> >> metrics
>>> >>>> for
>>> >>>>>>> all connectors. I have some slight concerns about fetchLatency
>>> and
>>> >>>>>>> latency though. They are computed based on EventTime which is
>>> not a
>>> >>>> purely
>>> >>>>>>> technical feature. It depends often on some business logic,
>>> might be
>>> >>>> absent
>>> >>>>>>> or defined after source. Those metrics could also behave in a
>>> weird
>>> >>>> way in
>>> >>>>>>> case of replaying backlog. Therefore I am not sure if we should
>>> >> include
>>> >>>>>>> those metrics by default. Maybe we could at least introduce a
>>> feature
>>> >>>>>>> switch for them? What do you think?
>>> >>>>>>>
>>> >>>>>>> Best,
>>> >>>>>>>
>>> >>>>>>> Dawid
>>> >>>>>>> On 21/02/2019 03:13, Becket Qin wrote:
>>> >>>>>>>
>>> >>>>>>> Bump. If there is no objections to the proposed metrics. I'll
>>> start a
>>> >>>>>>> voting thread later toady.
>>> >>>>>>>
>>> >>>>>>> Thanks,
>>> >>>>>>>
>>> >>>>>>> Jiangjie (Becket) Qin
>>> >>>>>>>
>>> >>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin <becket....@gmail.com>
>>> <
>>> >>>> becket....@gmail.com> wrote:
>>> >>>>>>> Hi folks,
>>> >>>>>>>
>>> >>>>>>> I would like to start the FLIP discussion thread about
>>> standardize
>>> >> the
>>> >>>>>>> connector metrics.
>>> >>>>>>>
>>> >>>>>>> In short, we would like to provide a convention of Flink
>>> connector
>>> >>>>>>> metrics. It will help simplify the monitoring and alerting on
>>> Flink
>>> >>>> jobs.
>>> >>>>>>> The FLIP link is following:
>>> >>>>>>>
>>> >>>>>>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
>>> >>>>>>> Thanks,
>>> >>>>>>>
>>> >>>>>>> Jiangjie (Becket) Qin
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>
>>>
>>>

Reply via email to