Re: [DISCUSS] PIP-264: Enhanced OTel-based metric system

Devin Bost Wed, 14 Jun 2023 18:10:45 -0700

> Thanks for the details, Devin. Curios - 'We' stands for which company?


What do you mean? I was quoting Rajan when I said, "we."


Devin G. Bost


On Wed, Jun 14, 2023 at 10:02 AM Asaf Mesika <asaf.mes...@gmail.com> wrote:

> Thanks for the details, Devin. Curios - 'We' stands for which company?
>
> Can you take a look at my previous response to see if it answers the
> concern you raised?
>
> Thanks!
>
>
> On Wed, Jun 14, 2023 at 1:49 PM Devin Bost <devin.b...@gmail.com> wrote:
>
> > > Hi,
> > >
> > > Are we proposing a change to break existing metrics compatibility
> > > (prometheus)? If that is the case then it's a big red flag as it will
> be
> > a
> > > pain for any company to upgrade Pulsar as monitoring is THE most
> > important
> > > part of the system and we don't even want to break compatibility for
> any
> > > small things to avoid interruption for users that are using Pulsar
> > system.
> > > I think it's always good to enhance a system by maintaining
> compatibility
> > > and I would be fine if we can introduce new metrics API without causing
> > ANY
> > > interruption to existing metrics API. But if we can't maintain
> > > compatibility then it's a big red flag and not acceptable for the
> Pulsar
> > > community.
> >
> > Proposing a large breaking change (even if it's crucial) is the single
> > fastest way to motivate your users to migrate to a different platform. I
> > wish it wasn't the case, but it's the cold reality.
> >
> > With that said, I'm a big proponent of Open Telemetry. I did a big video
> a
> > while back that some of you may remember on the use of Open Tracing
> (before
> > it was merged into Open Telemetry). Open Telemetry has gained
> considerable
> > momentum in the industry since then.
> >
> > I'm also very interested in a solution to the metrics problem. I've run
> > into the scalability issues with metrics in production, and I've been
> very
> > concerned about the metrics bottlenecks around our ability to deliver our
> > promises around supporting large numbers of topics. One of the big
> > advantages of Pulsar over Kafka is supposed to be that topics are cheap,
> > but as it stands, our current metrics design gets seriously in the way of
> > that. Generally speaking, I'm open to solutions, especially if they align
> > us with a growing industry standard.
> >
> > - Devin
> >
> >
> > On Wed, Jun 14, 2023, 3:28 AM Enrico Olivelli <eolive...@gmail.com>
> wrote:
> >
> > > Il Mer 14 Giu 2023, 04:33 Rajan Dhabalia <rdhaba...@apache.org> ha
> > > scritto:
> > >
> > > > Hi,
> > > >
> > > > Are we proposing a change to break existing metrics compatibility
> > > > (prometheus)? If that is the case then it's a big red flag as it will
> > be
> > > a
> > > > pain for any company to upgrade Pulsar as monitoring is THE most
> > > important
> > > > part of the system and we don't even want to break compatibility for
> > any
> > > > small things to avoid interruption for users that are using Pulsar
> > > system.
> > > > I think it's always good to enhance a system by maintaining
> > compatibility
> > > > and I would be fine if we can introduce new metrics API without
> causing
> > > ANY
> > > > interruption to existing metrics API. But if we can't maintain
> > > > compatibility then it's a big red flag and not acceptable for the
> > Pulsar
> > > > community.
> > > >
> > >
> > > I agree.
> > >
> > > If it is possible to export data Ina way that is compatible with
> > Prometheus
> > > without adding too much overhead then I would support this work.
> > >
> > > About renaming the metrics: we can do it only if tue changes for users
> > are
> > > as trivial as replacing the queries in the grafana dashboard or in
> > alerting
> > > systems.
> > >
> > > Asaf, do you have prototype? Built over any version of Pulsar?
> > >
> > > Also, it would be very useful to start an initiative to collect the
> list
> > of
> > > metrics that people really use in production, especially for automated
> > > alerts.
> > >
> > > In my experience you usually care about:
> > > - in/out traffic (rates, bytes...)
> > > - number of producer, consumers, topics, subscriptions...
> > > - backlog
> > > - jvm metrics
> > > - function custom metrics
> > >
> > >
> > > Enrico
> > >
> > >
> > >
> > >
> > > > Thanks,
> > > > Rajan
> > > >
> > > > On Sun, May 21, 2023 at 9:01 AM Asaf Mesika <asaf.mes...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks for the reply, Enrico.
> > > > > Completely agree.
> > > > > This made me realize my TL;DR wasn't talking about export.
> > > > > I added this to it:
> > > > >
> > > > > ---
> > > > > Pulsar OTel Metrics will support exporting as Prometheus HTTP
> > endpoint
> > > > > (`/metrics` but different port) for backward compatibility and also
> > > OLTP,
> > > > > so you can push the metrics to OTel Collector and from there ship
> it
> > to
> > > > any
> > > > > destination.
> > > > > ---
> > > > >
> > > > > OTel supports two kinds of exporter: Prometheus (HTTP) and OTLP
> > (push).
> > > > > We'll just configure to use them.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, May 15, 2023 at 10:35 AM Enrico Olivelli <
> > eolive...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Asaf,
> > > > > > thanks for contributing in this area.
> > > > > > Metrics are a fundamental feature of Pulsar.
> > > > > >
> > > > > > Currently I find it very awkward to maintain metrics, and also I
> > see
> > > > > > it as a problem to support only Prometheus.
> > > > > >
> > > > > > Regarding your proposal, IIRC in the past someone else proposed
> to
> > > > > > support other metrics systems and they have been suggested to
> use a
> > > > > > sidecar approach,
> > > > > > that is to add something next to Pulsar services that served the
> > > > > > metrics in the preferred format/way.
> > > > > > I find that the sidecar approach is too inefficient and I am not
> > > > > > proposing it (but I wanted to add this reference for the benefit
> of
> > > > > > new people on the list).
> > > > > >
> > > > > > I wonder if it would be possible to keep compatibility with the
> > > > > > current Prometheus based metrics.
> > > > > > Now Pulsar reached a point in which is is widely used by many
> > > > > > companies and also with big clusters,
> > > > > > telling people that they have to rework all the infrastructure
> > > related
> > > > > > to metrics because we don't support Prometheus anymore or because
> > we
> > > > > > changed radically the way we publish metrics
> > > > > > It is a step that seems too hard from my point of view.
> > > > > >
> > > > > > Currently I believe that compatibility is more important than
> > > > > > versatility, and if we want to introduce new (and far better)
> > > features
> > > > > > we must take it into account.
> > > > > >
> > > > > > So my point is that I generally support the idea of opening the
> way
> > > to
> > > > > > Open Telemetry, but we must have a way to not force all of our
> > users
> > > > > > to throw away their alerting systems, dashboards and know-how in
> > > > > > troubleshooting Pulsar problems in production and dev
> > > > > >
> > > > > > Best regards
> > > > > > Enrico
> > > > > >
> > > > > > Il giorno lun 15 mag 2023 alle ore 02:17 Dave Fisher
> > > > > > <wave4d...@comcast.net> ha scritto:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > On May 10, 2023, at 1:01 AM, Asaf Mesika <
> > asaf.mes...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > On Tue, May 9, 2023 at 11:29 PM Dave Fisher <
> w...@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>>> On May 8, 2023, at 2:49 AM, Asaf Mesika <
> > > asaf.mes...@gmail.com>
> > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>> Your feedback made me realized I need to add "TL;DR"
> section,
> > > > > which I
> > > > > > > >> just
> > > > > > > >>> added.
> > > > > > > >>>
> > > > > > > >>> I'm quoting it here. It gives a brief summary of the
> > proposal,
> > > > > which
> > > > > > > >>> requires up to 5 min of read time, helping you get a high
> > level
> > > > > > picture
> > > > > > > >>> before you dive into the background/motivation/solution.
> > > > > > > >>>
> > > > > > > >>> ----------------------
> > > > > > > >>> TL;DR
> > > > > > > >>>
> > > > > > > >>> Working with Metrics today as a user or a developer is hard
> > and
> > > > has
> > > > > > many
> > > > > > > >>> severe issues.
> > > > > > > >>>
> > > > > > > >>> From the user perspective:
> > > > > > > >>>
> > > > > > > >>>  - One of Pulsar strongest feature is "cheap" topics so you
> > can
> > > > > > easily
> > > > > > > >>>  have 10k - 100k topics per broker. Once you do that, you
> > > quickly
> > > > > > learn
> > > > > > > >> that
> > > > > > > >>>  the amount of metrics you export via "/metrics"
> (Prometheus
> > > > style
> > > > > > > >> endpoint)
> > > > > > > >>>  becomes really big. The cost to store them becomes too
> high,
> > > > > queries
> > > > > > > >>>  time-out or even "/metrics" endpoint it self times out.
> > > > > > > >>>  The only option Pulsar gives you today is all-or-nothing
> > > > filtering
> > > > > > and
> > > > > > > >>>  very crude aggregation. You switch metrics from topic
> > > > aggregation
> > > > > > > >> level to
> > > > > > > >>>  namespace aggregation level. Also you can turn off
> producer
> > > and
> > > > > > > >> consumer
> > > > > > > >>>  level metrics. You end up doing it all leaving you
> "blind",
> > > > > looking
> > > > > > at
> > > > > > > >> the
> > > > > > > >>>  metrics from a namespace level which is too high level.
> You
> > > end
> > > > up
> > > > > > > >>>  conjuring all kinds of scripts on top of topic stats
> > endpoint
> > > to
> > > > > > glue
> > > > > > > >> some
> > > > > > > >>>  aggregated metrics view for the topics you need.
> > > > > > > >>>  - Summaries (metric type giving you quantiles like p95)
> > which
> > > > are
> > > > > > used
> > > > > > > >>>  in Pulsar, can't be aggregated across topics / brokers due
> > its
> > > > > > inherent
> > > > > > > >>>  design.
> > > > > > > >>>  - Plugin authors spend too much time on defining and
> > exposing
> > > > > > metrics
> > > > > > > >> to
> > > > > > > >>>  Pulsar since the only interface Pulsar offers is writing
> > your
> > > > > > metrics
> > > > > > > >> by
> > > > > > > >>>  your self as UTF-8 bytes in Prometheus Text Format to byte
> > > > stream
> > > > > > > >> interface
> > > > > > > >>>  given to you.
> > > > > > > >>>  - Pulsar histograms are exported in a way that is not
> > > conformant
> > > > > > with
> > > > > > > >>>  Prometheus, which means you can't get the p95 quantile on
> > such
> > > > > > > >> histograms,
> > > > > > > >>>  making them very hard to use in day to day life.
> > > > > > > >>
> > > > > > > >> What version of DataSketches is used to produce the
> histogram?
> > > Is
> > > > is
> > > > > > still
> > > > > > > >> an old Yahoo one, or are we using an updated one from Apache
> > > > > > DataSketches?
> > > > > > > >>
> > > > > > > >> Seems like this is a single PR/small PIP for 3.1?
> > > > > > > >
> > > > > > > >
> > > > > > > > Histograms are a list of buckets, each is a counter.
> > > > > > > > Summary is a collection of values collected over a time
> window,
> > > > which
> > > > > > at
> > > > > > > > the end you get a calculation of the quantiles of those
> values:
> > > > p95,
> > > > > > p50,
> > > > > > > > and those are exported from Pulsar.
> > > > > > > >
> > > > > > > > Pulsar histogram do not use Data Sketches.
> > > > > > >
> > > > > > > Bookkeeper Metrics wraps Yahoo DataSketches last I checked.
> > > > > > >
> > > > > > > > They are just counters.
> > > > > > > > They are not adhere to Prometheus since:
> > > > > > > > a. The counter is expected to be cumulative, but Pulsar
> resets
> > > each
> > > > > > bucket
> > > > > > > > counter to 0 every 1 min
> > > > > > > > b. The bucket upper range is expected to be written as an
> > > attribute
> > > > > > "le"
> > > > > > > > but today it is encoded in the name of the metric itself.
> > > > > > > >
> > > > > > > > This is a breaking change, hence hard to mark in any small
> > > release.
> > > > > > > > This is why it's part of this PIP since so many things will
> > > break,
> > > > > and
> > > > > > all
> > > > > > > > of them will break on a separate layer (OTel metrics), hence
> > not
> > > > > break
> > > > > > > > anyone without their consent.
> > > > > > >
> > > > > > > If this change will break existing Grafana dashboards and other
> > > > > > operational monitoring already in place then it will break
> > guarantees
> > > > we
> > > > > > have made about safely being able to downgrade from a bad
> upgrade.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>>  - Too many metrics are rates which also delta reset every
> > > > interval
> > > > > > you
> > > > > > > >>>  configure in Pulsar and restart, instead of relying on
> > > > cumulative
> > > > > > (ever
> > > > > > > >>>  growing) counters and let Prometheus use its rate
> function.
> > > > > > > >>>  - and many more issues
> > > > > > > >>>
> > > > > > > >>> From the developer perspective:
> > > > > > > >>>
> > > > > > > >>>  - There are 4 different ways to define and record metrics
> in
> > > > > Pulsar:
> > > > > > > >>>  Pulsar own metrics library, Prometheus Java Client,
> > Bookkeeper
> > > > > > metrics
> > > > > > > >>>  library and plain native Java SDK objects (AtomicLong,
> ...).
> > > > It's
> > > > > > very
> > > > > > > >>>  confusing for the developer and create inconsistencies for
> > the
> > > > end
> > > > > > user
> > > > > > > >>>  (e.g. Summary for example is different in each).
> > > > > > > >>>  - Patching your metrics into "/metrics" Prometheus
> endpoint
> > is
> > > > > > > >>>  confusing, cumbersome and error prone.
> > > > > > > >>>  - many more
> > > > > > > >>>
> > > > > > > >>> This proposal offers several key changes to solve that:
> > > > > > > >>>
> > > > > > > >>>  - Cardinality (supporting 10k-100k topics per broker) is
> > > solved
> > > > by
> > > > > > > >>>  introducing a new aggregation level for metrics called
> Topic
> > > > > Metric
> > > > > > > >> Group.
> > > > > > > >>>  Using configuration, you specify for each topic its group
> > > (using
> > > > > > > >>>  wildcard/regex). This allows you to "zoom" out to a more
> > > > detailed
> > > > > > > >>>  granularity level like groups instead of namespaces, which
> > you
> > > > > > control
> > > > > > > >> how
> > > > > > > >>>  many groups you'll have hence solving the cardinality
> issue,
> > > > > without
> > > > > > > >>>  sacrificing level of detail too much.
> > > > > > > >>>  - Fine-grained filtering mechanism, dynamic. You'll have
> > > > > rule-based
> > > > > > > >>>  dynamic configuration, allowing you to specify per
> > > > > > > >> namespace/topic/group
> > > > > > > >>>  which metrics you'd like to keep/drop. Rules allows you to
> > set
> > > > the
> > > > > > > >> default
> > > > > > > >>>  to have small amount of metrics in group and namespace
> level
> > > > only
> > > > > > and
> > > > > > > >> drop
> > > > > > > >>>  the rest. When needed, you can add an override rule to
> > "open"
> > > > up a
> > > > > > > >> certain
> > > > > > > >>>  group to have more metrics in higher granularity (topic or
> > > even
> > > > > > > >>>  consumer/producer level). Since it's dynamic you "open"
> > such a
> > > > > group
> > > > > > > >> when
> > > > > > > >>>  you see it's misbehaving, see it in topic level, and when
> > all
> > > > > > > >> resolved, you
> > > > > > > >>>  can "close" it. A bit similar experience to logging levels
> > in
> > > > > Log4j
> > > > > > or
> > > > > > > >>>  Logback, that you default and override per class/package.
> > > > > > > >>>
> > > > > > > >>> Aggregation and Filtering combined solves the cardinality
> > > without
> > > > > > > >>> sacrificing the level of detail when needed and most
> > > importantly,
> > > > > you
> > > > > > > >>> determine which topic/group/namespace it happens on.
> > > > > > > >>>
> > > > > > > >>> Since this change is so invasive, it requires a single
> > metrics
> > > > > > library to
> > > > > > > >>> implement all of it on top of; Hence the third big change
> > point
> > > > is
> > > > > > > >>> consolidating all four ways to define and record metrics
> to a
> > > > > single
> > > > > > > >> one, a
> > > > > > > >>> new one: OpenTelemtry Metrics (Java SDK, and also Python
> and
> > Go
> > > > for
> > > > > > the
> > > > > > > >>> Pulsar Function runners).
> > > > > > > >>> Introducing OpenTelemetry (OTel) solves also the biggest
> pain
> > > > point
> > > > > > from
> > > > > > > >>> the developer perspective, since it's a superb metrics
> > library
> > > > > > offering
> > > > > > > >>> everything you need, and there is going to be a single way
> -
> > > only
> > > > > it.
> > > > > > > >> Also,
> > > > > > > >>> it solves the robustness for Plugin author which will use
> > > > > > OpenTelemetry.
> > > > > > > >> It
> > > > > > > >>> so happens that it also solves all the numerous problems
> > > > described
> > > > > > in the
> > > > > > > >>> doc itself.
> > > > > > > >>>
> > > > > > > >>> The solution will be introduced as another layer with
> feature
> > > > > > toggles, so
> > > > > > > >>> you can work with existing system, and/or OTel, until
> > gradually
> > > > > > > >> deprecating
> > > > > > > >>> existing system.
> > > > > > > >>>
> > > > > > > >>> It's a big breaking change for Pulsar users on many fronts:
> > > > names,
> > > > > > > >>> semantics, configuration. Read at the end of this doc to
> > learn
> > > > > > exactly
> > > > > > > >> what
> > > > > > > >>> will change for the user (in high level).
> > > > > > > >>>
> > > > > > > >>> In my opinion, it will make Pulsar user experience so much
> > > > better,
> > > > > > they
> > > > > > > >>> will want to migrate to it, despite the breaking change.
> > > > > > > >>>
> > > > > > > >>> This was a very short summary. You are most welcomed to
> read
> > > the
> > > > > full
> > > > > > > >>> design document below and express feedback, so we can make
> it
> > > > > better.
> > > > > > > >>>
> > > > > > > >>> On Sun, May 7, 2023 at 7:52 PM Asaf Mesika <
> > > > asaf.mes...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> On Sun, May 7, 2023 at 4:23 PM Yunze Xu
> > > > > > <y...@streamnative.io.invalid>
> > > > > > > >>>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> I'm excited to learn much more about metrics when I
> started
> > > > > reading
> > > > > > > >>>>> this proposal. But I became more and more frustrated
> when I
> > > > found
> > > > > > > >>>>> there is still too much content left even if I've already
> > > spent
> > > > > > much
> > > > > > > >>>>> time reading this proposal. I'm wondering how much time
> did
> > > you
> > > > > > expect
> > > > > > > >>>>> reviewers to read through this proposal? I just recalled
> > the
> > > > > > > >>>>> discussion you started before [1]. Did you expect each
> PMC
> > > > member
> > > > > > that
> > > > > > > >>>>> gives his/her +1 to read only parts of this proposal?
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>> I estimated around 2 hours needed for a reviewer.
> > > > > > > >>>> I hate it being so long, but I simply couldn't find a way
> to
> > > > > > downsize it
> > > > > > > >>>> more. Furthermore, I consulted with my colleagues
> including
> > > > > Matteo,
> > > > > > but
> > > > > > > >> we
> > > > > > > >>>> couldn't see a way to scope it down.
> > > > > > > >>>> Why? Because once you begin this journey, you need to know
> > how
> > > > > it's
> > > > > > > >> going
> > > > > > > >>>> to end.
> > > > > > > >>>> What I ended up doing, is writing all the crucial details
> > for
> > > > > > review in
> > > > > > > >>>> the High Level Design section.
> > > > > > > >>>> It's still a big, hefty section, but I don't think I can
> > step
> > > > out
> > > > > > or let
> > > > > > > >>>> anyone else change Pulsar so invasively without the full
> > > extent
> > > > of
> > > > > > the
> > > > > > > >>>> change.
> > > > > > > >>>>
> > > > > > > >>>> I don't think it's wise to read parts.
> > > > > > > >>>> I did my very best effort to minimize it, but the scope is
> > > > simply
> > > > > > big.
> > > > > > > >>>> Open for suggestions, but it requires reading all the PIP
> :)
> > > > > > > >>>>
> > > > > > > >>>> Thanks a lot Yunze for dedicating any time to it.
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> Let's talk back to the proposal, for now, what I mainly
> > > learned
> > > > > and
> > > > > > > >>>>> are concerned about mostly are:
> > > > > > > >>>>> 1. Pulsar has many ways to expose metrics. It's not
> unified
> > > and
> > > > > > > >> confusing.
> > > > > > > >>>>> 2. The current metrics system cannot support a large
> amount
> > > of
> > > > > > topics.
> > > > > > > >>>>> 3. It's hard for plugin authors to integrate metrics.
> (For
> > > > > example,
> > > > > > > >>>>> KoP [2] integrates metrics by implementing the
> > > > > > > >>>>> PrometheusRawMetricsProvider interface and it indeed
> needs
> > > much
> > > > > > work)
> > > > > > > >>>>>
> > > > > > > >>>>> Regarding the 1st issue, this proposal chooses
> > OpenTelemetry
> > > > > > (OTel).
> > > > > > > >>>>>
> > > > > > > >>>>> Regarding the 2nd issue, I scrolled to the "Why
> > > OpenTelemetry?"
> > > > > > > >>>>> section. It's still frustrating to see no answer.
> > > Eventually, I
> > > > > > found
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>> OpenTelemetry isn't the solution for large amount of
> topic.
> > > > > > > >>>> The solution is described at
> > > > > > > >>>> "Aggregate and Filtering to solve cardinality issues"
> > section.
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>> the explanation in the "What we need to fix in
> > OpenTelemetry
> > > -
> > > > > > > >>>>> Performance" section. It seems that we still need some
> > > > > > enhancements in
> > > > > > > >>>>> OTel. In other words, currently OTel is not ready for
> > > resolving
> > > > > all
> > > > > > > >>>>> these issues listed in the proposal but we believe it
> will.
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>> Let me rephrase "believe" --> we work together with the
> > > > > maintainers
> > > > > > to
> > > > > > > >> do
> > > > > > > >>>> it, yes.
> > > > > > > >>>> I am open for any other suggestion.
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> As for the 3rd issue, from the "Integrating with Pulsar
> > > > Plugins"
> > > > > > > >>>>> section, the plugin authors still need to implement the
> new
> > > > OTel
> > > > > > > >>>>> interfaces. Is it much easier than using the existing
> ways
> > to
> > > > > > expose
> > > > > > > >>>>> metrics? Could metrics still be easily integrated with
> > > Grafana?
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>> Yes, it's way easier.
> > > > > > > >>>> Basically you have a full fledged metrics library objects:
> > > > Meter,
> > > > > > Gauge,
> > > > > > > >>>> Histogram, Counter.
> > > > > > > >>>> No more Raw Metrics Provider, writing UTF-8 bytes in
> > > Prometheus
> > > > > > format.
> > > > > > > >>>> You get namespacing for free with Meter name and version.
> > > > > > > >>>> It's way better than current solution and any other
> library.
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> That's all I am concerned about at the moment. I
> > understand,
> > > > and
> > > > > > > >>>>> appreciate that you've spent much time studying and
> > > explaining
> > > > > all
> > > > > > > >>>>> these things. But, this proposal is still too huge.
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>> I appreciate your effort a lot!
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> [1]
> > > > > > https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf
> > > > > > > >>>>> [2]
> > > > > > > >>>>>
> > > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java
> > > > > > > >>>>>
> > > > > > > >>>>> Thanks,
> > > > > > > >>>>> Yunze
> > > > > > > >>>>>
> > > > > > > >>>>> On Sun, May 7, 2023 at 5:53 PM Asaf Mesika <
> > > > > asaf.mes...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>> I'm very appreciative for feedback from multiple pulsar
> > > users
> > > > > and
> > > > > > devs
> > > > > > > >>>>> on
> > > > > > > >>>>>> this PIP, since it has dramatic changes suggested and
> > quite
> > > > > > extensive
> > > > > > > >>>>>> positive change for the users.
> > > > > > > >>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>> On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika <
> > > > > > asaf.mes...@gmail.com>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>>> Hi all,
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I'm very excited to release a PIP I've been working on
> in
> > > the
> > > > > > past 11
> > > > > > > >>>>>>> months, which I think will be immensely valuable to
> > Pulsar,
> > > > > > which I
> > > > > > > >>>>> like so
> > > > > > > >>>>>>> much.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> PIP: https://github.com/apache/pulsar/issues/20197
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I'm quoting here the preface:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> === QUOTE START ===
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Roughly 11 months ago, I started working on solving the
> > > > biggest
> > > > > > issue
> > > > > > > >>>>> with
> > > > > > > >>>>>>> Pulsar metrics: the lack of ability to monitor a pulsar
> > > > broker
> > > > > > with a
> > > > > > > >>>>> large
> > > > > > > >>>>>>> topic count: 10k, 100k, and future support of 1M. This
> > > > started
> > > > > by
> > > > > > > >>>>> mapping
> > > > > > > >>>>>>> the existing functionality and then enumerating all the
> > > > > problems
> > > > > > I
> > > > > > > >>>>> saw (all
> > > > > > > >>>>>>> documented in this doc
> > > > > > > >>>>>>> <
> > > > > > > >>>>>
> > > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing
> > > > > > >
> > > > > > > I thought we were going to stop using Google docs for PIPs.
> > > > > > >
> > > > > > > >>>>>>
> > > > > > > >>>>>>> ).
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> This PIP is a parent PIP. It aims to gradually solve
> > (using
> > > > > > sub-PIPs)
> > > > > > > >>>>> all
> > > > > > > >>>>>>> the current metric system's problems and provide the
> > > ability
> > > > to
> > > > > > > >>>>> monitor a
> > > > > > > >>>>>>> broker with a large topic count, which is currently
> > > lacking.
> > > > > As a
> > > > > > > >>>>> parent
> > > > > > > >>>>>>> PIP, it will describe each problem and its solution at
> a
> > > high
> > > > > > level,
> > > > > > > >>>>>>> leaving fine-grained details to the sub-PIPs. The
> parent
> > > PIP
> > > > > > ensures
> > > > > > > >>>>> all
> > > > > > > >>>>>>> solutions align and does not contradict each other.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> The basic building block to solve the monitoring
> ability
> > of
> > > > > large
> > > > > > > >>>>> topic
> > > > > > > >>>>>>> count is aggregating internally (to topic groups) and
> > > adding
> > > > > > > >>>>> fine-grained
> > > > > > > >>>>>>> filtering. We could have shoe-horned it into the
> existing
> > > > > metric
> > > > > > > >>>>> system,
> > > > > > > >>>>>>> but we thought adding that to a system already
> ingrained
> > > with
> > > > > > many
> > > > > > > >>>>> problems
> > > > > > > >>>>>>> would be wrong and hard to do gradually, as so many
> > things
> > > > will
> > > > > > > >>>>> break. This
> > > > > > > >>>>>>> is why the second-biggest design decision presented
> here
> > is
> > > > > > > >>>>> consolidating
> > > > > > > >>>>>>> all existing metric libraries into a single one -
> > > > OpenTelemetry
> > > > > > > >>>>>>> <https://opentelemetry.io/>. The parent PIP will
> explain
> > > why
> > > > > > > >>>>>>> OpenTelemetry was chosen out of existing solutions and
> > why
> > > it
> > > > > far
> > > > > > > >>>>> exceeds
> > > > > > > >>>>>>> all other options. I’ve been working closely with the
> > > > > > OpenTelemetry
> > > > > > > >>>>>>> community in the past eight months: brain-storming this
> > > > > > integration,
> > > > > > > >>>>> and
> > > > > > > >>>>>>> raising issues, in an effort to remove serious blockers
> > to
> > > > make
> > > > > > this
> > > > > > > >>>>>>> migration successful.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I made every effort to summarize this document so that
> it
> > > can
> > > > > be
> > > > > > > >>>>> concise
> > > > > > > >>>>>>> yet clear. I understand it is an effort to read it and,
> > > more
> > > > > so,
> > > > > > > >>>>> provide
> > > > > > > >>>>>>> meaningful feedback on such a large document; hence I’m
> > > very
> > > > > > grateful
> > > > > > > >>>>> for
> > > > > > > >>>>>>> each individual who does so.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I think this design will help improve the user
> experience
> > > > > > immensely,
> > > > > > > >>>>> so it
> > > > > > > >>>>>>> is worth the time spent reading it.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> === QUOTE END ===
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Thanks!
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Asaf Mesika
> > > > > > > >>>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] PIP-264: Enhanced OTel-based metric system

Reply via email to