Re: [DISCUSS] FLIP-33: Standardize connector metrics

Chesnay Schepler Thu, 21 Mar 2019 02:43:55 -0700

As I said before, I believe this to be over-engineered and have nointerest in this implementation.

There are conceptual issues like defining a duplicate numBytesIn(PerSec)metric that already exists for each operator.


On 21.03.2019 06:13, Becket Qin wrote:

A few updates to the thread. I uploaded a patch[1] as a completeexample of how users can use the metrics in the future.

Some thoughts below after taking a look at the AbstractMetricGroup andits subclasses.

This patch intends to provide convenience for Flink connectorimplementations to follow metrics standards proposed in FLIP-33. Italso try to enhance the metric management in general way to help userswith:


 1. metric definition
 2. metric dependencies check
 3. metric validation
 4. metric control (turn on / off particular metrics)


Ideally we should only have one entry point for the metrics.

Right now the entry point is |AbstractMetricGroup|. However, besidesthe missing functionality mentioned above, |AbstractMetricGroup| seemsdeeply rooted in Flink runtime. We could extract it out toflink-metrics in order to use it for generic purpose. There will besome work, though.

Another approach is to make |AbstractMetrics| in this patch as themetric entry point. It wraps metric group and provides the missingfunctionalities. Then we can roll out this pattern to runtimecomponents gradually as well.

My first thought is that the latter approach gives a more smoothmigration. But I am also OK with doing a refactoring on the|AbstractMetricGroup| family.



Thanks,

Jiangjie (Becket) Qin

[1] https://github.com/becketqin/flink/pull/1

On Mon, Feb 25, 2019 at 2:32 PM Becket Qin <becket....@gmail.com<mailto:becket....@gmail.com>> wrote:


    Hi Chesnay,

    It might be easier to discuss some implementation details in the
    PR review instead of in the FLIP discussion thread. I have a patch
    for Kafka connectors ready but haven't submitted the PR yet.
    Hopefully that will help explain a bit more.

    ** Re: metric type binding
    This is a valid point that worths discussing. If I understand
    correctly, there are two points:

    1. Metric type / interface does not matter as long as the metric
    semantic is clearly defined.
    Conceptually speaking, I agree that as long as the metric semantic
    is defined, metric type does not matter. To some extent, Gauge /
    Counter / Meter / Histogram themselves can be think of as some
    well-recognized semantics, if you wish. In Flink, these metric
    semantics have their associated interface classes. In practice,
    such semantic to interface binding seems necessary for different
    components to communicate.  Simply standardize the semantic of the
    connector metrics seems not sufficient for people to build
    ecosystem on top of. At the end of the day, we still need to have
    some embodiment of the metric semantics that people can program
    against.

    2. Sometimes the same metric semantic can be exposed using
    different metric types / interfaces.
    This is a good point. Counter and Gauge-as-a-Counter are pretty
    much interchangeable. This is more of a trade-off between the user
    experience of metric producers and consumers. The metric producers
    want to use Counter or Gauge depending on whether the counter is
    already tracked in code, while ideally the metric consumers only
    want to see a single metric type for each metric. I am leaning
    towards to make the metric producers happy, i.e. allow Gauge /
    Counter metric type, and the the metric consumers handle the type
    variation. The reason is that in practice, there might be more
    connector implementations than metric reporter implementations. We
    could also provide some helper method to facilitate reading from
    such variable metric type.


    Just some quick replies to the comments around implementation details.

        4) single place where metrics are registered except
        connector-specific
        ones (which we can't really avoid).

    Register connector specific ones in a single place is actually
    something that I want to achieve.

        2) I'm talking about time-series databases like Prometheus. We
        would
        only have a gauge metric exposing the last fetchTime/emitTime
        that is
        regularly reported to the backend (Prometheus), where a user
        could build
        a histogram of his choosing when/if he wants it.

    Not sure if such downsampling works. As an example, if a user
    complains that there are some intermittent latency spikes (maybe a
    few records in 10 seconds) in their processing system. Having a
    Gauge sampling instantaneous latency seems unlikely useful.
    However by looking at actual 99.9 percentile latency might help.

    Thanks,

    Jiangjie (Becket) Qin


    On Fri, Feb 22, 2019 at 9:30 PM Chesnay Schepler
    <ches...@apache.org <mailto:ches...@apache.org>> wrote:

        Re: over complication of implementation.

        I think I get understand better know what you're shooting for,
        effectively something like the OperatorIOMetricGroup.
        But still, re-define setupConnectorMetrics() to accept a set
        of flags
        for counters/meters(ans _possibly_ histograms) along with a
        set of
        well-defined Optional<Gauge<?>>, and return the group.

        Solves all issues as far as i can tell:
        1) no metrics must be created manually (except Gauges, which are
        effectively just Suppliers and you can't get around this),
        2) additional metrics can be registered on the returned group,
        3) see 1),
        4) single place where metrics are registered except
        connector-specific
        ones (which we can't really avoid).

        Re: Histogram

        1) As an example, whether "numRecordsIn" is exposed as a
        Counter or a
        Gauge should be irrelevant. So far we're using the metric type
        that is
        the most convenient at exposing a given value. If there is
        some backing
        data-structure that we want to expose some data from we
        typically opt
        for a Gauge, as otherwise we're just mucking around with the
        Meter/Counter API to get it to match. Similarly, if we want to
        count
        something but no current count exists we typically added a
        Counter.
        That's why attaching semantics to metric types makes little
        sense (but
        unfortunately several reporters already do it); for
        counters/meters
        certainly, but the majority of metrics are gauges.

        2) I'm talking about time-series databases like Prometheus. We
        would
        only have a gauge metric exposing the last fetchTime/emitTime
        that is
        regularly reported to the backend (Prometheus), where a user
        could build
        a histogram of his choosing when/if he wants it.

        On 22.02.2019 13:57, Becket Qin wrote:
        > Hi Chesnay,
        >
        > Thanks for the explanation.
        >
        > ** Re: FLIP
        > I might have misunderstood this, but it seems that "major
        changes" are well
        > defined in FLIP. The full contents is following:
        > What is considered a "major change" that needs a FLIP?
        >
        > Any of the following should be considered a major change:
        >
        >     - Any major new feature, subsystem, or piece of
        functionality
        >     - *Any change that impacts the public interfaces of the
        project*
        >
        > What are the "public interfaces" of the project?
        >
        >
        >
        > *All of the following are public interfaces *that people
        build around:
        >
        >     - DataStream and DataSet API, including classes related
        to that, such as
        >     StreamExecutionEnvironment
        >
        >
        >     - Classes marked with the @Public annotation
        >
        >
        >     - On-disk binary formats, such as checkpoints/savepoints
        >
        >
        >     - User-facing scripts/command-line tools, i.e.
        bin/flink, Yarn scripts,
        >     Mesos scripts
        >
        >
        >     - Configuration settings
        >
        >
        >     - *Exposed monitoring information*
        >
        >
        > So any monitoring information change is considered as public
        interface, and
        > any public interface change is considered as a "major change".
        >
        >
        > ** Re: over complication of implementation.
        >
        > Although this is more of implementation details that is not
        covered by the
        > FLIP. But it may be worth discussing.
        >
        > First of all, I completely agree that we should use the
        simplest way to
        > achieve our goal. To me the goal is the following:
        > 1. Clear connector conventions and interfaces.
        > 2. The easiness of creating a connector.
        >
        > Both of them are important to the prosperity of the
        connector ecosystem. So
        > I'd rather abstract as much as possible on our side to make
        the connector
        > developer's work lighter. Given this goal, a static util
        method approach
        > might have a few drawbacks:
        > 1. Users still have to construct the metrics by themselves.
        (And note that
        > this might be erroneous by itself. For example, a customer
        wrapper around
        > dropwizard meter maybe used instead of MeterView).
        > 2. When connector specific metrics are added, it is
        difficult to enforce
        > the scope to be the same as standard metrics.
        > 3. It seems that a method proliferation is inevitable if we
        want to apply
        > sanity checks. e.g. The metric of numBytesIn was not
        registered for a meter.
        > 4. Metrics are still defined in random places and hard to track.
        >
        > The current PR I had was inspired by the Config system in
        Kafka, which I
        > found pretty handy. In fact it is not only used by Kafka
        itself but even
        > some other projects that depend on Kafka. I am not saying
        this approach is
        > perfect. But I think it worths to save the work for
        connector writers and
        > encourage more systematic implementation. That being said, I
        am fully open
        > to suggestions.
        >
        >
        > Re: Histogram
        > I think there are two orthogonal questions around those metrics:
        >
        > 1. Regardless of the metric type, by just looking at the
        meaning of a
        > metric, is generic to all connectors? If the answer is yes,
        we should
        > include the metric into the convention. No matter whether we
        include it
        > into the convention or not, some connector implementations
        will emit such
        > metric. It is better to have a convention than letting each
        connector do
        > random things.
        >
        > 2. If a standard metric is a histogram, what should we do?
        > I agree that we should make it clear that using histograms
        will have
        > performance risk. But I do see histogram is useful in some
        fine-granularity
        > debugging where one do not have the luxury to stop the
        system and inject
        > more inspection code. So the workaround I am thinking is to
        provide some
        > implementation suggestions. Assume later on we have a
        mechanism of
        > selective metrics. In the abstract metrics class we can
        disable those
        > metrics by default individual connector writers does not
        have to do
        > anything (this is another advantage of having an
        AbstractMetrics instead of
        > static util methods.)
        >
        > I am not sure I fully understand the histogram in the
        backend approach. Can
        > you explain a bit more? Do you mean emitting the raw data,
        e.g. fetchTime
        > and emitTime with each record and let the histogram
        computation happen in
        > the background? Or let the processing thread putting the
        values into a
        > queue and have a separate thread polling from the queue and
        add them into
        > the histogram?
        >
        > Thanks,
        >
        > Jiangjie (Becket) Qin
        >
        >
        >
        >
        >
        > On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler
        <ches...@apache.org <mailto:ches...@apache.org>> wrote:
        >
        >> Re: Flip
        >> The very first line under both the main header and Purpose
        section
        >> describe Flips as "major changes", which this isn't.
        >>
        >> Re: complication
        >> I'm not arguing against standardization, but again an
        over-complicated
        >> implementation when a static utility method would be
        sufficient.
        >>
        >> public static void setupConnectorMetrics(
        >> MetricGroup operatorMetricGroup,
        >> String connectorName,
        >> Optional<Gauge<Long>> numRecordsIn,
        >> ...)
        >>
        >> This gives you all you need:
        >> * a well-defined set of metrics for a connector to opt-in
        >> * standardized naming schemes for scope and individual metrics
        >> * standardize metric types (although personally I'm not
        interested in that
        >> since metric types should be considered syntactic sugar)
        >>
        >> Re: Configurable Histogram
        >> If anything they _must_ be turned off by default, but the
        metric system is
        >> already exposing so many options that I'm not too keen on
        adding even more.
        >> You have also only addressed my first argument against
        histograms
        >> (performance), the second one still stands (calculate
        histogram in metric
        >> backends instead).
        >>
        >> On 21.02.2019 16:27, Becket Qin wrote:
        >>> Hi Chesnay,
        >>>
        >>> Thanks for the comments. I think this is worthy of a FLIP
        because it is
        >>> public API. According to the FLIP description a FlIP is
        required in case
        >> of:
        >>>      - Any change that impacts the public interfaces of
        the project
        >>>
        >>> and the following entry is found in the definition of
        "public interface".
        >>>
        >>>      - Exposed monitoring information
        >>>
        >>> Metrics are critical to any production system. So a clear
        metric
        >> definition
        >>> is important for any serious users. For an organization
        with large Flink
        >>> installation, change in metrics means great amount of
        work. So such
        >> changes
        >>> do need to be fully discussed and documented.
        >>>
        >>> ** Re: Histogram.
        >>> We can discuss whether there is a better way to expose
        metrics that are
        >>> suitable for histograms. My micro-benchmark on various
        histogram
        >>> implementations also indicates that they are significantly
        slower than
        >>> other metric types. But I don't think that means never use
        histogram, but
        >>> means use it with caution. For example, we can suggest the
        >> implementations
        >>> to turn them off by default and only turn it on for a
        small amount of
        >> time
        >>> when performing some micro-debugging.
        >>>
        >>> ** Re: complication:
        >>> Connector conventions are essential for Flink ecosystem.
        Flink connectors
        >>> pool is probably the most important part of Flink, just
        like any other
        >> data
        >>> system. Clear conventions of connectors will help build
        Flink ecosystem
        >> in
        >>> a more organic way.
        >>> Take the metrics convention as an example, imagine someone
        has developed
        >> a
        >>> Flink connector for System foo, and another developer may
        have developed
        >> a
        >>> monitoring and diagnostic framework for Flink which
        analyzes the Flink
        >> job
        >>> performance based on metrics. With a clear metric
        convention, those two
        >>> projects could be developed independently. Once users put
        them together,
        >>> it would work without additional modifications. This
        cannot be easily
        >>> achieved by just defining a few constants.
        >>>
        >>> ** Re: selective metrics:
        >>> Sure, we can discuss that in a separate thread.
        >>>
        >>> @Dawid
        >>>
        >>> ** Re: latency / fetchedLatency
        >>> The primary purpose of establish such a convention is to
        help developers
        >>> write connectors in a more compatible way. The convention
        is supposed to
        >> be
        >>> defined more proactively. So when look at the convention,
        it seems more
        >>> important to see if the concept is applicable to
        connectors in general.
        >> It
        >>> might be true so far only Kafka connector reports latency.
        But there
        >> might
        >>> be hundreds of other connector implementations in the
        Flink ecosystem,
        >>> though not in the Flink repo, and some of them also emits
        latency. I
        >> think
        >>> a lot of other sources actually also has an append
        timestamp. e.g.
        >> database
        >>> bin logs and some K-V stores. So I wouldn't be surprised
        if some database
        >>> connector can also emit latency metrics.
        >>>
        >>> Thanks,
        >>>
        >>> Jiangjie (Becket) Qin
        >>>
        >>>
        >>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler
        <ches...@apache.org <mailto:ches...@apache.org>>
        >>> wrote:
        >>>
        >>>> Regarding 2) It doesn't make sense to investigate this as
        part of this
        >>>> FLIP. This is something that could be of interest for the
        entire metric
        >>>> system, and should be designed for as such.
        >>>>
        >>>> Regarding the proposal as a whole:
        >>>>
        >>>> Histogram metrics shall not be added to the core of
        Flink. They are
        >>>> significantly more expensive than other metrics, and
        calculating
        >>>> histograms in the application is regarded as an
        anti-pattern by several
        >>>> metric backends, who instead recommend to expose the raw
        data and
        >>>> calculate the histogram in the backend.
        >>>>
        >>>> Second, this seems overly complicated. Given that we
        already established
        >>>> that not all connectors will export all metrics we are
        effectively
        >>>> reducing this down to a consistent naming scheme. We
        don't need anything
        >>>> sophisticated for that; basically just a few constants
        that all
        >>>> connectors use.
        >>>>
        >>>> I'm not convinced that this is worthy of a FLIP.
        >>>>
        >>>> On 21.02.2019 14:26, Dawid Wysakowicz wrote:
        >>>>> Hi,
        >>>>>
        >>>>> Ad 1. In general I undestand and I agree. But those
        particular metrics
        >>>>> (latency, fetchLatency), right now would only be
        reported if user uses
        >>>>> KafkaConsumer with internal timestampAssigner with
        StreamCharacteristic
        >>>>> set to EventTime, right? That sounds like a very
        specific case. I am
        >> not
        >>>>> sure if we should introduce a generic metric that will be
        >>>>> disabled/absent for most of implementations.
        >>>>>
        >>>>> Ad.2 That sounds like an orthogonal issue, that might
        make sense to
        >>>>> investigate in the future.
        >>>>>
        >>>>> Best,
        >>>>>
        >>>>> Dawid
        >>>>>
        >>>>> On 21/02/2019 13:20, Becket Qin wrote:
        >>>>>> Hi Dawid,
        >>>>>>
        >>>>>> Thanks for the feedback. That makes sense to me. There
        are two cases
        >> to
        >>>> be
        >>>>>> addressed.
        >>>>>>
        >>>>>> 1. The metrics are supposed to be a guidance. It is
        likely that a
        >>>> connector
        >>>>>> only supports some but not all of the metrics. In that
        case, each
        >>>> connector
        >>>>>> implementation should have the freedom to decide which
        metrics are
        >>>>>> reported. For the metrics that are supported, the
        guidance should be
        >>>>>> followed.
        >>>>>>
        >>>>>> 2. Sometimes users may want to disable certain metrics
        for some reason
        >>>>>> (e.g. performance / reprocessing of data). A generic
        mechanism should
        >> be
        >>>>>> provided to allow user choose which metrics are
        reported. This
        >> mechanism
        >>>>>> should also be honored by the connector implementations.
        >>>>>>
        >>>>>> Does this sound reasonable to you?
        >>>>>>
        >>>>>> Thanks,
        >>>>>>
        >>>>>> Jiangjie (Becket) Qin
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz <
        >>>> dwysakow...@apache.org <mailto:dwysakow...@apache.org>>
        >>>>>> wrote:
        >>>>>>
        >>>>>>> Hi,
        >>>>>>>
        >>>>>>> Generally I like the idea of having a unified,
        standard set of
        >> metrics
        >>>> for
        >>>>>>> all connectors. I have some slight concerns about
        fetchLatency and
        >>>>>>> latency though. They are computed based on EventTime
        which is not a
        >>>> purely
        >>>>>>> technical feature. It depends often on some business
        logic, might be
        >>>> absent
        >>>>>>> or defined after source. Those metrics could also
        behave in a weird
        >>>> way in
        >>>>>>> case of replaying backlog. Therefore I am not sure if
        we should
        >> include
        >>>>>>> those metrics by default. Maybe we could at least
        introduce a feature
        >>>>>>> switch for them? What do you think?
        >>>>>>>
        >>>>>>> Best,
        >>>>>>>
        >>>>>>> Dawid
        >>>>>>> On 21/02/2019 03:13, Becket Qin wrote:
        >>>>>>>
        >>>>>>> Bump. If there is no objections to the proposed
        metrics. I'll start a
        >>>>>>> voting thread later toady.
        >>>>>>>
        >>>>>>> Thanks,
        >>>>>>>
        >>>>>>> Jiangjie (Becket) Qin
        >>>>>>>
        >>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin
        <becket....@gmail.com <mailto:becket....@gmail.com>> <
        >>>> becket....@gmail.com <mailto:becket....@gmail.com>> wrote:
        >>>>>>> Hi folks,
        >>>>>>>
        >>>>>>> I would like to start the FLIP discussion thread about
        standardize
        >> the
        >>>>>>> connector metrics.
        >>>>>>>
        >>>>>>> In short, we would like to provide a convention of
        Flink connector
        >>>>>>> metrics. It will help simplify the monitoring and
        alerting on Flink
        >>>> jobs.
        >>>>>>> The FLIP link is following:
        >>>>>>>
        >>>>>>>
        >>
        
https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
        >>>>>>> Thanks,
        >>>>>>>
        >>>>>>> Jiangjie (Becket) Qin
        >>>>>>>
        >>>>>>>
        >>>>>>>
        >>

Re: [DISCUSS] FLIP-33: Standardize connector metrics

Reply via email to