In any case, we probably have to make some change to the existing MetricGroup interface. We can discuss that in this thread as well.
On Thu, Mar 21, 2019 at 1:13 PM Becket Qin <becket....@gmail.com> wrote: > A few updates to the thread. I uploaded a patch[1] as a complete example > of how users can use the metrics in the future. > > Some thoughts below after taking a look at the AbstractMetricGroup and its > subclasses. > > This patch intends to provide convenience for Flink connector > implementations to follow metrics standards proposed in FLIP-33. It also > try to enhance the metric management in general way to help users with: > > 1. metric definition > 2. metric dependencies check > 3. metric validation > 4. metric control (turn on / off particular metrics) > > This patch wraps MetricGroup to extend the functionality of > AbstractMetricGroup and its subclasses. The AbstractMetricGroup mainly > focus on the metric group hierarchy, but does not really manage the metrics > other than keeping them in a Map. > > Ideally we should only have one entry point for the metrics. > > Right now the entry point is AbstractMetricGroup. However, besides the > missing functionality mentioned above, AbstractMetricGroup seems deeply > rooted in Flink runtime. We could extract it out to flink-metrics in order > to use it for generic purpose. There will be some work, though. > > Another approach is to make AbstractMetrics in this patch as the metric > entry point. It wraps metric group and provides the missing > functionalities. Then we can roll out this pattern to runtime components > gradually as well. > > My first thought is that the latter approach gives a more smooth > migration. But I am also OK with doing a refactoring on the > AbstractMetricGroup family. > > Thanks, > > Jiangjie (Becket) Qin > > [1] https://github.com/becketqin/flink/pull/1 > > On Mon, Feb 25, 2019 at 2:32 PM Becket Qin <becket....@gmail.com> wrote: > >> Hi Chesnay, >> >> It might be easier to discuss some implementation details in the PR >> review instead of in the FLIP discussion thread. I have a patch for Kafka >> connectors ready but haven't submitted the PR yet. Hopefully that will help >> explain a bit more. >> >> ** Re: metric type binding >> This is a valid point that worths discussing. If I understand correctly, >> there are two points: >> >> 1. Metric type / interface does not matter as long as the metric semantic >> is clearly defined. >> Conceptually speaking, I agree that as long as the metric semantic is >> defined, metric type does not matter. To some extent, Gauge / Counter / >> Meter / Histogram themselves can be think of as some well-recognized >> semantics, if you wish. In Flink, these metric semantics have their >> associated interface classes. In practice, such semantic to interface >> binding seems necessary for different components to communicate. Simply >> standardize the semantic of the connector metrics seems not sufficient for >> people to build ecosystem on top of. At the end of the day, we still need >> to have some embodiment of the metric semantics that people can program >> against. >> >> 2. Sometimes the same metric semantic can be exposed using different >> metric types / interfaces. >> This is a good point. Counter and Gauge-as-a-Counter are pretty much >> interchangeable. This is more of a trade-off between the user experience of >> metric producers and consumers. The metric producers want to use Counter or >> Gauge depending on whether the counter is already tracked in code, while >> ideally the metric consumers only want to see a single metric type for each >> metric. I am leaning towards to make the metric producers happy, i.e. allow >> Gauge / Counter metric type, and the the metric consumers handle the type >> variation. The reason is that in practice, there might be more connector >> implementations than metric reporter implementations. We could also provide >> some helper method to facilitate reading from such variable metric type. >> >> >> Just some quick replies to the comments around implementation details. >> >>> 4) single place where metrics are registered except connector-specific >>> ones (which we can't really avoid). >> >> Register connector specific ones in a single place is actually something >> that I want to achieve. >> >> 2) I'm talking about time-series databases like Prometheus. We would >>> only have a gauge metric exposing the last fetchTime/emitTime that is >>> regularly reported to the backend (Prometheus), where a user could build >>> a histogram of his choosing when/if he wants it. >>> >> Not sure if such downsampling works. As an example, if a user complains >> that there are some intermittent latency spikes (maybe a few records in 10 >> seconds) in their processing system. Having a Gauge sampling instantaneous >> latency seems unlikely useful. However by looking at actual 99.9 percentile >> latency might help. >> >> Thanks, >> >> Jiangjie (Becket) Qin >> >> >> On Fri, Feb 22, 2019 at 9:30 PM Chesnay Schepler <ches...@apache.org> >> wrote: >> >>> Re: over complication of implementation. >>> >>> I think I get understand better know what you're shooting for, >>> effectively something like the OperatorIOMetricGroup. >>> But still, re-define setupConnectorMetrics() to accept a set of flags >>> for counters/meters(ans _possibly_ histograms) along with a set of >>> well-defined Optional<Gauge<?>>, and return the group. >>> >>> Solves all issues as far as i can tell: >>> 1) no metrics must be created manually (except Gauges, which are >>> effectively just Suppliers and you can't get around this), >>> 2) additional metrics can be registered on the returned group, >>> 3) see 1), >>> 4) single place where metrics are registered except connector-specific >>> ones (which we can't really avoid). >>> >>> Re: Histogram >>> >>> 1) As an example, whether "numRecordsIn" is exposed as a Counter or a >>> Gauge should be irrelevant. So far we're using the metric type that is >>> the most convenient at exposing a given value. If there is some backing >>> data-structure that we want to expose some data from we typically opt >>> for a Gauge, as otherwise we're just mucking around with the >>> Meter/Counter API to get it to match. Similarly, if we want to count >>> something but no current count exists we typically added a Counter. >>> That's why attaching semantics to metric types makes little sense (but >>> unfortunately several reporters already do it); for counters/meters >>> certainly, but the majority of metrics are gauges. >>> >>> 2) I'm talking about time-series databases like Prometheus. We would >>> only have a gauge metric exposing the last fetchTime/emitTime that is >>> regularly reported to the backend (Prometheus), where a user could build >>> a histogram of his choosing when/if he wants it. >>> >>> On 22.02.2019 13:57, Becket Qin wrote: >>> > Hi Chesnay, >>> > >>> > Thanks for the explanation. >>> > >>> > ** Re: FLIP >>> > I might have misunderstood this, but it seems that "major changes" are >>> well >>> > defined in FLIP. The full contents is following: >>> > What is considered a "major change" that needs a FLIP? >>> > >>> > Any of the following should be considered a major change: >>> > >>> > - Any major new feature, subsystem, or piece of functionality >>> > - *Any change that impacts the public interfaces of the project* >>> > >>> > What are the "public interfaces" of the project? >>> > >>> > >>> > >>> > *All of the following are public interfaces *that people build around: >>> > >>> > - DataStream and DataSet API, including classes related to that, >>> such as >>> > StreamExecutionEnvironment >>> > >>> > >>> > - Classes marked with the @Public annotation >>> > >>> > >>> > - On-disk binary formats, such as checkpoints/savepoints >>> > >>> > >>> > - User-facing scripts/command-line tools, i.e. bin/flink, Yarn >>> scripts, >>> > Mesos scripts >>> > >>> > >>> > - Configuration settings >>> > >>> > >>> > - *Exposed monitoring information* >>> > >>> > >>> > So any monitoring information change is considered as public >>> interface, and >>> > any public interface change is considered as a "major change". >>> > >>> > >>> > ** Re: over complication of implementation. >>> > >>> > Although this is more of implementation details that is not covered by >>> the >>> > FLIP. But it may be worth discussing. >>> > >>> > First of all, I completely agree that we should use the simplest way to >>> > achieve our goal. To me the goal is the following: >>> > 1. Clear connector conventions and interfaces. >>> > 2. The easiness of creating a connector. >>> > >>> > Both of them are important to the prosperity of the connector >>> ecosystem. So >>> > I'd rather abstract as much as possible on our side to make the >>> connector >>> > developer's work lighter. Given this goal, a static util method >>> approach >>> > might have a few drawbacks: >>> > 1. Users still have to construct the metrics by themselves. (And note >>> that >>> > this might be erroneous by itself. For example, a customer wrapper >>> around >>> > dropwizard meter maybe used instead of MeterView). >>> > 2. When connector specific metrics are added, it is difficult to >>> enforce >>> > the scope to be the same as standard metrics. >>> > 3. It seems that a method proliferation is inevitable if we want to >>> apply >>> > sanity checks. e.g. The metric of numBytesIn was not registered for a >>> meter. >>> > 4. Metrics are still defined in random places and hard to track. >>> > >>> > The current PR I had was inspired by the Config system in Kafka, which >>> I >>> > found pretty handy. In fact it is not only used by Kafka itself but >>> even >>> > some other projects that depend on Kafka. I am not saying this >>> approach is >>> > perfect. But I think it worths to save the work for connector writers >>> and >>> > encourage more systematic implementation. That being said, I am fully >>> open >>> > to suggestions. >>> > >>> > >>> > Re: Histogram >>> > I think there are two orthogonal questions around those metrics: >>> > >>> > 1. Regardless of the metric type, by just looking at the meaning of a >>> > metric, is generic to all connectors? If the answer is yes, we should >>> > include the metric into the convention. No matter whether we include it >>> > into the convention or not, some connector implementations will emit >>> such >>> > metric. It is better to have a convention than letting each connector >>> do >>> > random things. >>> > >>> > 2. If a standard metric is a histogram, what should we do? >>> > I agree that we should make it clear that using histograms will have >>> > performance risk. But I do see histogram is useful in some >>> fine-granularity >>> > debugging where one do not have the luxury to stop the system and >>> inject >>> > more inspection code. So the workaround I am thinking is to provide >>> some >>> > implementation suggestions. Assume later on we have a mechanism of >>> > selective metrics. In the abstract metrics class we can disable those >>> > metrics by default individual connector writers does not have to do >>> > anything (this is another advantage of having an AbstractMetrics >>> instead of >>> > static util methods.) >>> > >>> > I am not sure I fully understand the histogram in the backend >>> approach. Can >>> > you explain a bit more? Do you mean emitting the raw data, e.g. >>> fetchTime >>> > and emitTime with each record and let the histogram computation happen >>> in >>> > the background? Or let the processing thread putting the values into a >>> > queue and have a separate thread polling from the queue and add them >>> into >>> > the histogram? >>> > >>> > Thanks, >>> > >>> > Jiangjie (Becket) Qin >>> > >>> > >>> > >>> > >>> > >>> > On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler <ches...@apache.org> >>> wrote: >>> > >>> >> Re: Flip >>> >> The very first line under both the main header and Purpose section >>> >> describe Flips as "major changes", which this isn't. >>> >> >>> >> Re: complication >>> >> I'm not arguing against standardization, but again an over-complicated >>> >> implementation when a static utility method would be sufficient. >>> >> >>> >> public static void setupConnectorMetrics( >>> >> MetricGroup operatorMetricGroup, >>> >> String connectorName, >>> >> Optional<Gauge<Long>> numRecordsIn, >>> >> ...) >>> >> >>> >> This gives you all you need: >>> >> * a well-defined set of metrics for a connector to opt-in >>> >> * standardized naming schemes for scope and individual metrics >>> >> * standardize metric types (although personally I'm not interested in >>> that >>> >> since metric types should be considered syntactic sugar) >>> >> >>> >> Re: Configurable Histogram >>> >> If anything they _must_ be turned off by default, but the metric >>> system is >>> >> already exposing so many options that I'm not too keen on adding even >>> more. >>> >> You have also only addressed my first argument against histograms >>> >> (performance), the second one still stands (calculate histogram in >>> metric >>> >> backends instead). >>> >> >>> >> On 21.02.2019 16:27, Becket Qin wrote: >>> >>> Hi Chesnay, >>> >>> >>> >>> Thanks for the comments. I think this is worthy of a FLIP because it >>> is >>> >>> public API. According to the FLIP description a FlIP is required in >>> case >>> >> of: >>> >>> - Any change that impacts the public interfaces of the project >>> >>> >>> >>> and the following entry is found in the definition of "public >>> interface". >>> >>> >>> >>> - Exposed monitoring information >>> >>> >>> >>> Metrics are critical to any production system. So a clear metric >>> >> definition >>> >>> is important for any serious users. For an organization with large >>> Flink >>> >>> installation, change in metrics means great amount of work. So such >>> >> changes >>> >>> do need to be fully discussed and documented. >>> >>> >>> >>> ** Re: Histogram. >>> >>> We can discuss whether there is a better way to expose metrics that >>> are >>> >>> suitable for histograms. My micro-benchmark on various histogram >>> >>> implementations also indicates that they are significantly slower >>> than >>> >>> other metric types. But I don't think that means never use >>> histogram, but >>> >>> means use it with caution. For example, we can suggest the >>> >> implementations >>> >>> to turn them off by default and only turn it on for a small amount of >>> >> time >>> >>> when performing some micro-debugging. >>> >>> >>> >>> ** Re: complication: >>> >>> Connector conventions are essential for Flink ecosystem. Flink >>> connectors >>> >>> pool is probably the most important part of Flink, just like any >>> other >>> >> data >>> >>> system. Clear conventions of connectors will help build Flink >>> ecosystem >>> >> in >>> >>> a more organic way. >>> >>> Take the metrics convention as an example, imagine someone has >>> developed >>> >> a >>> >>> Flink connector for System foo, and another developer may have >>> developed >>> >> a >>> >>> monitoring and diagnostic framework for Flink which analyzes the >>> Flink >>> >> job >>> >>> performance based on metrics. With a clear metric convention, those >>> two >>> >>> projects could be developed independently. Once users put them >>> together, >>> >>> it would work without additional modifications. This cannot be easily >>> >>> achieved by just defining a few constants. >>> >>> >>> >>> ** Re: selective metrics: >>> >>> Sure, we can discuss that in a separate thread. >>> >>> >>> >>> @Dawid >>> >>> >>> >>> ** Re: latency / fetchedLatency >>> >>> The primary purpose of establish such a convention is to help >>> developers >>> >>> write connectors in a more compatible way. The convention is >>> supposed to >>> >> be >>> >>> defined more proactively. So when look at the convention, it seems >>> more >>> >>> important to see if the concept is applicable to connectors in >>> general. >>> >> It >>> >>> might be true so far only Kafka connector reports latency. But there >>> >> might >>> >>> be hundreds of other connector implementations in the Flink >>> ecosystem, >>> >>> though not in the Flink repo, and some of them also emits latency. I >>> >> think >>> >>> a lot of other sources actually also has an append timestamp. e.g. >>> >> database >>> >>> bin logs and some K-V stores. So I wouldn't be surprised if some >>> database >>> >>> connector can also emit latency metrics. >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Jiangjie (Becket) Qin >>> >>> >>> >>> >>> >>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler < >>> ches...@apache.org> >>> >>> wrote: >>> >>> >>> >>>> Regarding 2) It doesn't make sense to investigate this as part of >>> this >>> >>>> FLIP. This is something that could be of interest for the entire >>> metric >>> >>>> system, and should be designed for as such. >>> >>>> >>> >>>> Regarding the proposal as a whole: >>> >>>> >>> >>>> Histogram metrics shall not be added to the core of Flink. They are >>> >>>> significantly more expensive than other metrics, and calculating >>> >>>> histograms in the application is regarded as an anti-pattern by >>> several >>> >>>> metric backends, who instead recommend to expose the raw data and >>> >>>> calculate the histogram in the backend. >>> >>>> >>> >>>> Second, this seems overly complicated. Given that we already >>> established >>> >>>> that not all connectors will export all metrics we are effectively >>> >>>> reducing this down to a consistent naming scheme. We don't need >>> anything >>> >>>> sophisticated for that; basically just a few constants that all >>> >>>> connectors use. >>> >>>> >>> >>>> I'm not convinced that this is worthy of a FLIP. >>> >>>> >>> >>>> On 21.02.2019 14:26, Dawid Wysakowicz wrote: >>> >>>>> Hi, >>> >>>>> >>> >>>>> Ad 1. In general I undestand and I agree. But those particular >>> metrics >>> >>>>> (latency, fetchLatency), right now would only be reported if user >>> uses >>> >>>>> KafkaConsumer with internal timestampAssigner with >>> StreamCharacteristic >>> >>>>> set to EventTime, right? That sounds like a very specific case. I >>> am >>> >> not >>> >>>>> sure if we should introduce a generic metric that will be >>> >>>>> disabled/absent for most of implementations. >>> >>>>> >>> >>>>> Ad.2 That sounds like an orthogonal issue, that might make sense to >>> >>>>> investigate in the future. >>> >>>>> >>> >>>>> Best, >>> >>>>> >>> >>>>> Dawid >>> >>>>> >>> >>>>> On 21/02/2019 13:20, Becket Qin wrote: >>> >>>>>> Hi Dawid, >>> >>>>>> >>> >>>>>> Thanks for the feedback. That makes sense to me. There are two >>> cases >>> >> to >>> >>>> be >>> >>>>>> addressed. >>> >>>>>> >>> >>>>>> 1. The metrics are supposed to be a guidance. It is likely that a >>> >>>> connector >>> >>>>>> only supports some but not all of the metrics. In that case, each >>> >>>> connector >>> >>>>>> implementation should have the freedom to decide which metrics are >>> >>>>>> reported. For the metrics that are supported, the guidance should >>> be >>> >>>>>> followed. >>> >>>>>> >>> >>>>>> 2. Sometimes users may want to disable certain metrics for some >>> reason >>> >>>>>> (e.g. performance / reprocessing of data). A generic mechanism >>> should >>> >> be >>> >>>>>> provided to allow user choose which metrics are reported. This >>> >> mechanism >>> >>>>>> should also be honored by the connector implementations. >>> >>>>>> >>> >>>>>> Does this sound reasonable to you? >>> >>>>>> >>> >>>>>> Thanks, >>> >>>>>> >>> >>>>>> Jiangjie (Becket) Qin >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz < >>> >>>> dwysakow...@apache.org> >>> >>>>>> wrote: >>> >>>>>> >>> >>>>>>> Hi, >>> >>>>>>> >>> >>>>>>> Generally I like the idea of having a unified, standard set of >>> >> metrics >>> >>>> for >>> >>>>>>> all connectors. I have some slight concerns about fetchLatency >>> and >>> >>>>>>> latency though. They are computed based on EventTime which is >>> not a >>> >>>> purely >>> >>>>>>> technical feature. It depends often on some business logic, >>> might be >>> >>>> absent >>> >>>>>>> or defined after source. Those metrics could also behave in a >>> weird >>> >>>> way in >>> >>>>>>> case of replaying backlog. Therefore I am not sure if we should >>> >> include >>> >>>>>>> those metrics by default. Maybe we could at least introduce a >>> feature >>> >>>>>>> switch for them? What do you think? >>> >>>>>>> >>> >>>>>>> Best, >>> >>>>>>> >>> >>>>>>> Dawid >>> >>>>>>> On 21/02/2019 03:13, Becket Qin wrote: >>> >>>>>>> >>> >>>>>>> Bump. If there is no objections to the proposed metrics. I'll >>> start a >>> >>>>>>> voting thread later toady. >>> >>>>>>> >>> >>>>>>> Thanks, >>> >>>>>>> >>> >>>>>>> Jiangjie (Becket) Qin >>> >>>>>>> >>> >>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin <becket....@gmail.com> >>> < >>> >>>> becket....@gmail.com> wrote: >>> >>>>>>> Hi folks, >>> >>>>>>> >>> >>>>>>> I would like to start the FLIP discussion thread about >>> standardize >>> >> the >>> >>>>>>> connector metrics. >>> >>>>>>> >>> >>>>>>> In short, we would like to provide a convention of Flink >>> connector >>> >>>>>>> metrics. It will help simplify the monitoring and alerting on >>> Flink >>> >>>> jobs. >>> >>>>>>> The FLIP link is following: >>> >>>>>>> >>> >>>>>>> >>> >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics >>> >>>>>>> Thanks, >>> >>>>>>> >>> >>>>>>> Jiangjie (Becket) Qin >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >> >>> >>>