Re: [DISCUSSION] Kafka Metrics Reporter

Chesnay Schepler Thu, 21 Nov 2019 03:02:22 -0800

So this probably doesn't belong in this thread, but here goes:

When you think of the metric system as source and reporters and sinks,one has to consider what he source emits:


Either:
a) events for added/removed metrics

b) periodically emit the values of all metrics, with the plethora ofadditional scope information the reporters might require

Approach a) obviously doesn't work in a distributed setting, but isclosest to the current approach.Reporters access metrics / meta-info as needed, and we only use as muchresources as we actually need.

Approach b) does work in a distributed setting, and conceptually worksreasonable well with scheduled reporters (i.e., reporters thatperiodically write data to the external system, but for reporters whoare polled on demand from some external system (prometheus, jmx) thiscan cause resources to be wasted if the polling interval is larger thanthe update interval or metrics aren't being polled at all.Additionally, you'd have to include _a lot_ of metadata for each metricto retain current functionality.

Naturally this approach also consumes additional network resources.

Another concern I have is that we're mixing concerns here; one beingabout allowing users to process metrics in a datastream fashion (whichhas quite a few caveats, like not being able to access metrics from thedispatcher / RM since we surely aren't running user-code in them), theother being about handling formats.

And I have to point out that the format problem only really applies tokafka.For other reporters we don't have this problem since the backends definethe format, and there are just easier ways to handle this without amajor rework (wrap kafka connector, have a factory for serializationschemes, _done_)

I'd now suggest to move the process-as-a-datastream idea into adifferent thread because it a) isn't _really_ connected to the reporterit self and b) we already have enough points of contention.

As for having the connector in flink vs flink-packages, I'm constantlyamazed at how much value people attribute to the source of a reporterbeing in Flink.After all, there's _nothing_ stopping us from including additional3rd-party reporters in the distribution.There's also _nothing_ stopping us from linking to 3rd-party reportersin the documentation.

In other words, all user-facing parts are agnostic to whether thereporter is maintained by Flink or not, hence I'm not accepting theargument anymore that something must be in Flink for it to be used. Itjust doesn't make sense.

Having every widely used component within Flink is just not maintainablein the long run, as well all know, hence I'm very much in favor ofhaving it maintained externally via flink-packages. That's the verypurpose of that site.

And, on a final note, there's also _nothing_ stopping us from addingsomething to Flink after X <time_unit> if it becomes an integral in theway Flink is being used.


On 20/11/2019 15:46, Gyula Fóra wrote:

@Becket , Yun:
Regarding the core/ecosystem project:

I don't completely agree with your arguments regarding why this should be
an external ecosystem project instead of part of the Flink repo.
A metric connector is relevant for the Flink users, not the metric store.
Metric storage systems don't care about where logs are coming but Flink job
authors need a way to get the metrics to whatever systems they have. The
same applies for other connectors. If we don't provide canonical ways of
communicating with external systems, be it sources, sinks or metrics that
makes everyones life a bit harder.

Historically most of the connectors went straight to Flink and over time
the maintenance of these has become quite a challenge with Flink core
itself growing rapidly. I agree that we have to make these decision and not
include every new external connector to the Flink core. I think this
decision should be based on the value it brings to users, and how often it
will be used. These are not easy questions and the Flink ecosystem website
is a great way for gauging the popularity/value of a specific connector.

Another way of deciding this would be to talk to the Flink community (like
we do with this thread) and see if this is a common pattern and if we can
come up with a good generic solution regarding Kafka versioning and formats
that will work for most. If we see big interest here and have a consensus
on the formats I don't see any reason why we shouldn't include it.

Regarding the message format:
The idea with the JSON format was that it could be an easy-to-use source
for downstream metric systems to integrate with it. I don't have much
experience with different metric storage systems so maybe Yun you are right
that you will always end up having another processor for this. But even in
that case JSON is a pretty safe format as it is easy to process no matter
what you use.
Otherwise I agree that a pluggable format would be much better and more
generic. We just need to find a way to keep it simple :D

@Bowen, Chensay

The whole idea of making a metrics reporter source sounds pretty great at
first :) If we could do this that would definitely make this more flexible
but even then you probably need some sort of a serialization schema
implementation for the metrics by default. Which is basically what the
kafka reporter would do + minimal client.

Chesnay, I don't completely understand what you mean by:
"Periodically emitting the values of all metrics goes against the
convention which we established"
Isn't this exactly what the Kafka metrics reporter would do anyways?

If you could elaborate on this a bit more that would be very helpful for me
because I don't have a good overview of the key design principals of the
current metric system.

Cheers,
Gyula

On Wed, Nov 20, 2019 at 10:22 AM Chesnay Schepler <[email protected]>
wrote:

@Bowen I can see where you're coming from, but I don't think this would
work too well. Your "stream" would have to contain events for
added/removed metrics, but metrics are inherently not Serializable. I
think this would end up being a weird special case.

(Periodically emitting the values of all metrics goes against the
convention which we established from the very beginning that metrics
should only incur costs if necessary; as such a reporter that polls on
demand should only consume resources if it was called)

Additionally, there are plans to add additional methods to the reporter
in the future, at which point the source interface would no longer
suffice. At that point you'd need a separate interface again, and
wrappers for your sinks.

This would result in what is the trivial solution for this reporter
right now anyway: have the reporter use a kafka connector internally,
with all the features that if offers.

Overall I think we'd be unnecessarily coupling reporters to the source
interface, and i don't see a true benefit.

On 19/11/2019 19:47, Bowen Li wrote:

Hi,

What still unclear to me so far is - As I don't see any yet., what would

be

the fundamental differences between this Kafka reporter and Flink’s
existing Kafka producer?

I’ve been thinking of Flink metrics for a while, and the “metric

reporter”

feels a bit redundant to me. As you may already knew, Flink has been used
to process external metrics in various companies. If you think about it,
Flink’s own metric system is no different from external ones and actually
just another stream source, and metric reporters are just some data sinks
writing to external storage, with no guarantee or checkpointing.

So instead of adding Kafka or other MQ reporters and worrying about

message

format (which are already solved by Flink’s sinks), we can generalize and
expose Flink’s metrics system to be a simple built-in stream source, and
"metric reporters" are just some customized sink tailored for this

source.

Users may even be able to access and process it in stream environment

with

data stream api. That give users full flexibility on manipulating Flink
metrics with Flink, and it’s more of a “eat your own dogfood” philosophy.

This seems too good to be true, and I haven't had time to think of the
details. Let me know if I miss anything here.


On Mon, Nov 18, 2019 at 09:51 Yun Tang <[email protected]> wrote:

Hi all

Glad to see this topic in community.
We at Alibaba also implemented a kafka metrics reporter and extend it to
other message queues like Alibaba cloud log service [1] half a year ago.
The reason why we not launch a similar discussion is that we previously
thought we only provide a way to report metrics to kafka. Unlike current
supported metrics reporter, e.g. InfluxDB, Graphite, they all have an
easy-to-use data source in grafana to visualize metrics. Even with kafka
metrics reporter, we still need another way to consume data out and

work as

a data source for observability platform, and this would be diverse for
different companies.

I think this is the main concern to include this in a popular

open-source

main repo, and I pretty agree with Becket's suggestion to contribute

this

as a flink-package and we could offer an end-to-end solution including

how

to visualize these metrics data.

[1] https://www.alibabacloud.com/help/doc-detail/29003.htm

Best
Yun Tang

On 11/18/19, 8:19 AM, "Becket Qin" <[email protected]> wrote:

      Hi Gyula,

      Thanks for bringing this up. It is a useful addition to have a

Kafka

      metrics reporter. I understand that we already have Prometheus and
DataDog
      reporters in the Flink main repo. However, personally speaking, I

would

      slightly prefer to have the Kafka metrics reporter as an ecosystem
project
      instead of in the main repo due to the following reasons:

      1. To keep core Flink more focused. So in general if a component is
more
      relevant to external system rather than Flink, it might be good to
keep it
      as an ecosystem project. And metrics reporter seems a good example

of

that.
      2. This helps encourage more contributions to Flink ecosystem

instead

of
      giving the impression that anything in Flink ecosystem must be in

Flink

      main repo.
      3. To facilitate our ecosystem project authors, we have launched a
      website[1] to help the community keep track of and advertise the
ecosystem
      projects. It looks a good place to put the Kafka metrics reporter.

      Regarding the message format, while I think use JSON by default is
fine as
      it does not introduce much external dependency, I wonder if we

should

make
      the message format pluggable. Many companies probably already have
their
      own serde format for all the Kafka messages. For example, maybe

they

would
      like to just use an Avro record for their metrics instead of
introducing a
      new JSON format. Also in many cases, there could be a lot of metric
      messages sent by the Flink jobs. JSON format is less efficient and
might
      have too much overhead in that case.

      Thanks,

      Jiangjie (Becket) Qin

      [1] https://flink-packages.org/


      On Mon, Nov 18, 2019 at 3:30 AM Konstantin Knauf <
[email protected]>
      wrote:

      > Hi Gyula,
      >
      > thank you for proposing this. +1 for adding a

KafkaMetricsReporter.

In
      > terms of the dependency we could go a similar route as for the
"universal"
      > Flink Kafka Connector which to my knowledge always tracks the

latest

Kafka
      > version as of the Flink release and relies on compatibility of

the

      > underlying KafkaClient. JSON sounds good to me.
      >
      > Cheers,
      >
      > Konstantin
      >
      >
      >
      >
      >
      > On Sun, Nov 17, 2019 at 1:46 PM Gyula Fóra <[email protected]>
wrote:
      >
      > > Hi all!
      > >
      > > Several users have asked in the past about a Kafka based

metrics

reporter
      > > which can serve as a natural connector between arbitrary metric
storage
      > > systems and a straightforward way to process Flink metrics
downstream.
      > >
      > > I think this would be an extremely useful addition but I would
like to
      > hear
      > > what others in the dev community think about it before

submitting a

      > proper
      > > proposal.
      > >
      > > There are at least 3 questions to discuss here:
      > >
      > >
      > > *1. Do we want the Kafka metrics reporter in the Flink repo?*
As it is
      > > much more generic than other metrics reporters already

included, I

would
      > > say yes. Also as almost everyone uses Flink with Kafka it

would be

a
      > > natural reporter choice for a lot of users.
      > > *2. How should we handle the Kafka dependency of the

connector?*

      > >     I think it would be an overkill to add different Kafka
versions here,
      > > so I would use Kafka 2.+ which has the best compatibility and

is

future
      > > proof
      > > *3. What message format should we use?*
      > >     I would go with JSON for readability and compatibility
      > >
      > > There is a relevant JIRA open for this already.
      > > https://issues.apache.org/jira/browse/FLINK-14531
      > >
      > > We at Cloudera also promote this as a scalable way of pushing
metrics to
      > > other systems so we are very happy to contribute an

implementation

or
      > > cooperate with others on building it.
      > >
      > > Please let me know what you think!
      > >
      > > Cheers,
      > > Gyula
      > >
      >
      >
      > --
      >
      > Konstantin Knauf | Solutions Architect
      >
      > +49 160 91394525
      >
      >
      > Follow us @VervericaData Ververica <https://www.ververica.com/>
      >
      >
      > --
      >
      > Join Flink Forward <https://flink-forward.org/> - The Apache

Flink

      > Conference
      >
      > Stream Processing | Event Driven | Real Time
      >
      > --
      >
      > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
      >
      > --
      > Ververica GmbH
      > Registered at Amtsgericht Charlottenburg: HRB 158244 B
      > Managing Directors: Timothy Alexander Steinert, Yip Park Tung

Jason,

Ji
      > (Tony) Cheng
      >

Re: [DISCUSSION] Kafka Metrics Reporter

Reply via email to