Thanks for taking the time to review the document - *highly appreciated*.
I'm inlined my comments below.


On Mon, Aug 21, 2023 at 12:19 PM Hang Chen <chenh...@apache.org> wrote:

> Hi Asaf,
>     Thanks for bring up the great proposal.
>
> After reading this proposal, I have the following questions.
> 1. This proposal will introduce a big break change in Pulsar,
> especially in code perspective. I’m interested in how to support both
> old and new implementation at the same time step by step?
>
> >We will keep the current metric system as is, and add a new layer of
> metrics using OpenTelemetry Java SDK. All of Pulsar’s metrics will be
> create also using OpenTelemetry. A feature flag will allow enabling
> OpenTelemetry metrics (init, recording and exporting). All the features and
> changes described here will be done only in the OpenTelemetry layer,
> allowing to keep the old version working until you’re ready to switch using
> the OTel (OpenTelemetry) implementation. In the far future, once OTel usage
> has stabilized and became widely adopted we’ll deprecate current metric
> system and eventually remove it. We will also make sure there is feature
> flag to turn off current Prometheus based metric system.
>
>
Current metrics code remains as is, untouched.
I'm adding new code, using OpenTelemetry API and SDK. The code in most
cases will read the existing variables (like msgsReceived), and in other
cases will setup its own new objects like Counter, Histogram and *also*
record values to them.
You can take a look at the revised PIP as I've added tiny code sample to be
use as an idea how it will look like. Look here
<https://github.com/apache/pulsar/blob/6ec0bde4127a54ab8e8bb67fb091c932fa2952a4/pip/pip-264.md#consolidating-to-opentelemetry>
.



> 2. We introduced Group and filter logic in the metric system, and I
> have the following concerns.
> - We need to add protection logic or pre-validation for the group and
> filter rules to avoid users mis-configured causes huge performance
> impaction on Pulsar brokers
>
>
Good call. I've added a note in the PIP, that we will reject any filter
rules update if the expected number of data points exceed certain
threshold. I left this as detail to be specified in the sub-PIP.

- We need to support expose all the topic-level metrics when the
> Pulsar cluster just has thounds of topics
>
> I've added a new goal: "- New system should support the maximum supported
number of topics in current system (i.e. 4k topics) without filtering"


> - Even though we introduced group and filter for the metrics, we still
> can’t resolve large number of metrics exposed to Prometheus. Exposing
> large a mount of data (100MB+) throughput HTTP endpoint in
> ineffective. We can consider expose those metric data by Pulsar topic
> and develop a Pulsar to Prometheus connector to write Pulsar metric
> data to Prometheus in streaming mode instead of batch mode to reduce
> the performace impaction
>

As I wrote in my PIP. If you find your self exporting 100MB or more of
metric data *every* 30 seconds you will suffer from:
* High cost of TSDB holding that (e.g. Prometheus, Cortex, VictoriaMetrics)
* Query time out since there is too much data to read

Also, the bottleneck is not transfer time over the wire. It's mostly the
memory needed by any TSDB to hold it for at least 2 hours before flushing
it to disk - this it the most expensive of all.

At 100mb response size, filtering and grouping are a must.



>
> - Group and filter logic uses regular expressions extensively in
> rules. Regular expression parsing and matching are CPU and time
> intensive operations. We have push-down filter to reduce the generated
> metrics number, but still can’t solve the regular expression matching
> issues. If the user provide a complex regular expression for group and
> filter rule, the metric generating thread will be the bottleneck and
> will block other threads if we use synchronous call.
>
>
I plan to use caching as wrote in the PIP. Roughly (instrument, attributes)
-> boolean. It's basically as if we are adding one boolean to
PersistentTopic class - it has so many properties and size added is
negligible.


> - Group and filter rule is a litter complex for users and we need to
> provide a UI or tool to help user write the correct and effective
> rules and show the new rules impaction on old rules.
>

We don't have UI for now in Pulsar. We will make sure pulsar CLI will be
convenient enough.


>
>
> Thanks,
> Hang
>
> Matteo Merli <matteo.me...@gmail.com> 于2023年6月15日周四 23:14写道:
> >
> > > Proposing a large breaking change (even if it's crucial) is the single
> > fastest way to motivate your users to migrate to a different platform. I
> > wish it wasn't the case, but it's the cold reality.
> >
> > If you read the proposal, there is no real breaking change. There will
> be a
> > switch to choose the existing metrics or the new ones. The dashboards
> will
> > be updated and provided.
> >
> > At the same time, the best sure way to motivate users to switch or not
> > adopt a platform is to stick with confusing/inconsistent APIs/Metrics.
> >
> >
> > --
> > Matteo Merli
> > <matteo.me...@gmail.com>
> >
> >
> > On Wed, Jun 14, 2023 at 6:10 PM Devin Bost <devin.b...@gmail.com> wrote:
> >
> > > > Thanks for the details, Devin. Curios - 'We' stands for which
> company?
> > >
> > > What do you mean? I was quoting Rajan when I said, "we."
> > >
> > >
> > > Devin G. Bost
> > >
> > >
> > > On Wed, Jun 14, 2023 at 10:02 AM Asaf Mesika <asaf.mes...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the details, Devin. Curios - 'We' stands for which
> company?
> > > >
> > > > Can you take a look at my previous response to see if it answers the
> > > > concern you raised?
> > > >
> > > > Thanks!
> > > >
> > > >
> > > > On Wed, Jun 14, 2023 at 1:49 PM Devin Bost <devin.b...@gmail.com>
> wrote:
> > > >
> > > > > > Hi,
> > > > > >
> > > > > > Are we proposing a change to break existing metrics compatibility
> > > > > > (prometheus)? If that is the case then it's a big red flag as it
> will
> > > > be
> > > > > a
> > > > > > pain for any company to upgrade Pulsar as monitoring is THE most
> > > > > important
> > > > > > part of the system and we don't even want to break compatibility
> for
> > > > any
> > > > > > small things to avoid interruption for users that are using
> Pulsar
> > > > > system.
> > > > > > I think it's always good to enhance a system by maintaining
> > > > compatibility
> > > > > > and I would be fine if we can introduce new metrics API without
> > > causing
> > > > > ANY
> > > > > > interruption to existing metrics API. But if we can't maintain
> > > > > > compatibility then it's a big red flag and not acceptable for the
> > > > Pulsar
> > > > > > community.
> > > > >
> > > > > Proposing a large breaking change (even if it's crucial) is the
> single
> > > > > fastest way to motivate your users to migrate to a different
> platform.
> > > I
> > > > > wish it wasn't the case, but it's the cold reality.
> > > > >
> > > > > With that said, I'm a big proponent of Open Telemetry. I did a big
> > > video
> > > > a
> > > > > while back that some of you may remember on the use of Open Tracing
> > > > (before
> > > > > it was merged into Open Telemetry). Open Telemetry has gained
> > > > considerable
> > > > > momentum in the industry since then.
> > > > >
> > > > > I'm also very interested in a solution to the metrics problem.
> I've run
> > > > > into the scalability issues with metrics in production, and I've
> been
> > > > very
> > > > > concerned about the metrics bottlenecks around our ability to
> deliver
> > > our
> > > > > promises around supporting large numbers of topics. One of the big
> > > > > advantages of Pulsar over Kafka is supposed to be that topics are
> > > cheap,
> > > > > but as it stands, our current metrics design gets seriously in the
> way
> > > of
> > > > > that. Generally speaking, I'm open to solutions, especially if they
> > > align
> > > > > us with a growing industry standard.
> > > > >
> > > > > - Devin
> > > > >
> > > > >
> > > > > On Wed, Jun 14, 2023, 3:28 AM Enrico Olivelli <eolive...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Il Mer 14 Giu 2023, 04:33 Rajan Dhabalia <rdhaba...@apache.org>
> ha
> > > > > > scritto:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Are we proposing a change to break existing metrics
> compatibility
> > > > > > > (prometheus)? If that is the case then it's a big red flag as
> it
> > > will
> > > > > be
> > > > > > a
> > > > > > > pain for any company to upgrade Pulsar as monitoring is THE
> most
> > > > > > important
> > > > > > > part of the system and we don't even want to break
> compatibility
> > > for
> > > > > any
> > > > > > > small things to avoid interruption for users that are using
> Pulsar
> > > > > > system.
> > > > > > > I think it's always good to enhance a system by maintaining
> > > > > compatibility
> > > > > > > and I would be fine if we can introduce new metrics API without
> > > > causing
> > > > > > ANY
> > > > > > > interruption to existing metrics API. But if we can't maintain
> > > > > > > compatibility then it's a big red flag and not acceptable for
> the
> > > > > Pulsar
> > > > > > > community.
> > > > > > >
> > > > > >
> > > > > > I agree.
> > > > > >
> > > > > > If it is possible to export data Ina way that is compatible with
> > > > > Prometheus
> > > > > > without adding too much overhead then I would support this work.
> > > > > >
> > > > > > About renaming the metrics: we can do it only if tue changes for
> > > users
> > > > > are
> > > > > > as trivial as replacing the queries in the grafana dashboard or
> in
> > > > > alerting
> > > > > > systems.
> > > > > >
> > > > > > Asaf, do you have prototype? Built over any version of Pulsar?
> > > > > >
> > > > > > Also, it would be very useful to start an initiative to collect
> the
> > > > list
> > > > > of
> > > > > > metrics that people really use in production, especially for
> > > automated
> > > > > > alerts.
> > > > > >
> > > > > > In my experience you usually care about:
> > > > > > - in/out traffic (rates, bytes...)
> > > > > > - number of producer, consumers, topics, subscriptions...
> > > > > > - backlog
> > > > > > - jvm metrics
> > > > > > - function custom metrics
> > > > > >
> > > > > >
> > > > > > Enrico
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Thanks,
> > > > > > > Rajan
> > > > > > >
> > > > > > > On Sun, May 21, 2023 at 9:01 AM Asaf Mesika <
> asaf.mes...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the reply, Enrico.
> > > > > > > > Completely agree.
> > > > > > > > This made me realize my TL;DR wasn't talking about export.
> > > > > > > > I added this to it:
> > > > > > > >
> > > > > > > > ---
> > > > > > > > Pulsar OTel Metrics will support exporting as Prometheus HTTP
> > > > > endpoint
> > > > > > > > (`/metrics` but different port) for backward compatibility
> and
> > > also
> > > > > > OLTP,
> > > > > > > > so you can push the metrics to OTel Collector and from there
> ship
> > > > it
> > > > > to
> > > > > > > any
> > > > > > > > destination.
> > > > > > > > ---
> > > > > > > >
> > > > > > > > OTel supports two kinds of exporter: Prometheus (HTTP) and
> OTLP
> > > > > (push).
> > > > > > > > We'll just configure to use them.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, May 15, 2023 at 10:35 AM Enrico Olivelli <
> > > > > eolive...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Asaf,
> > > > > > > > > thanks for contributing in this area.
> > > > > > > > > Metrics are a fundamental feature of Pulsar.
> > > > > > > > >
> > > > > > > > > Currently I find it very awkward to maintain metrics, and
> also
> > > I
> > > > > see
> > > > > > > > > it as a problem to support only Prometheus.
> > > > > > > > >
> > > > > > > > > Regarding your proposal, IIRC in the past someone else
> proposed
> > > > to
> > > > > > > > > support other metrics systems and they have been suggested
> to
> > > > use a
> > > > > > > > > sidecar approach,
> > > > > > > > > that is to add something next to Pulsar services that
> served
> > > the
> > > > > > > > > metrics in the preferred format/way.
> > > > > > > > > I find that the sidecar approach is too inefficient and I
> am
> > > not
> > > > > > > > > proposing it (but I wanted to add this reference for the
> > > benefit
> > > > of
> > > > > > > > > new people on the list).
> > > > > > > > >
> > > > > > > > > I wonder if it would be possible to keep compatibility
> with the
> > > > > > > > > current Prometheus based metrics.
> > > > > > > > > Now Pulsar reached a point in which is is widely used by
> many
> > > > > > > > > companies and also with big clusters,
> > > > > > > > > telling people that they have to rework all the
> infrastructure
> > > > > > related
> > > > > > > > > to metrics because we don't support Prometheus anymore or
> > > because
> > > > > we
> > > > > > > > > changed radically the way we publish metrics
> > > > > > > > > It is a step that seems too hard from my point of view.
> > > > > > > > >
> > > > > > > > > Currently I believe that compatibility is more important
> than
> > > > > > > > > versatility, and if we want to introduce new (and far
> better)
> > > > > > features
> > > > > > > > > we must take it into account.
> > > > > > > > >
> > > > > > > > > So my point is that I generally support the idea of
> opening the
> > > > way
> > > > > > to
> > > > > > > > > Open Telemetry, but we must have a way to not force all of
> our
> > > > > users
> > > > > > > > > to throw away their alerting systems, dashboards and
> know-how
> > > in
> > > > > > > > > troubleshooting Pulsar problems in production and dev
> > > > > > > > >
> > > > > > > > > Best regards
> > > > > > > > > Enrico
> > > > > > > > >
> > > > > > > > > Il giorno lun 15 mag 2023 alle ore 02:17 Dave Fisher
> > > > > > > > > <wave4d...@comcast.net> ha scritto:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > On May 10, 2023, at 1:01 AM, Asaf Mesika <
> > > > > asaf.mes...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, May 9, 2023 at 11:29 PM Dave Fisher <
> > > > w...@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>>> On May 8, 2023, at 2:49 AM, Asaf Mesika <
> > > > > > asaf.mes...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >>>
> > > > > > > > > > >>> Your feedback made me realized I need to add "TL;DR"
> > > > section,
> > > > > > > > which I
> > > > > > > > > > >> just
> > > > > > > > > > >>> added.
> > > > > > > > > > >>>
> > > > > > > > > > >>> I'm quoting it here. It gives a brief summary of the
> > > > > proposal,
> > > > > > > > which
> > > > > > > > > > >>> requires up to 5 min of read time, helping you get a
> high
> > > > > level
> > > > > > > > > picture
> > > > > > > > > > >>> before you dive into the
> background/motivation/solution.
> > > > > > > > > > >>>
> > > > > > > > > > >>> ----------------------
> > > > > > > > > > >>> TL;DR
> > > > > > > > > > >>>
> > > > > > > > > > >>> Working with Metrics today as a user or a developer
> is
> > > hard
> > > > > and
> > > > > > > has
> > > > > > > > > many
> > > > > > > > > > >>> severe issues.
> > > > > > > > > > >>>
> > > > > > > > > > >>> From the user perspective:
> > > > > > > > > > >>>
> > > > > > > > > > >>>  - One of Pulsar strongest feature is "cheap" topics
> so
> > > you
> > > > > can
> > > > > > > > > easily
> > > > > > > > > > >>>  have 10k - 100k topics per broker. Once you do
> that, you
> > > > > > quickly
> > > > > > > > > learn
> > > > > > > > > > >> that
> > > > > > > > > > >>>  the amount of metrics you export via "/metrics"
> > > > (Prometheus
> > > > > > > style
> > > > > > > > > > >> endpoint)
> > > > > > > > > > >>>  becomes really big. The cost to store them becomes
> too
> > > > high,
> > > > > > > > queries
> > > > > > > > > > >>>  time-out or even "/metrics" endpoint it self times
> out.
> > > > > > > > > > >>>  The only option Pulsar gives you today is
> all-or-nothing
> > > > > > > filtering
> > > > > > > > > and
> > > > > > > > > > >>>  very crude aggregation. You switch metrics from
> topic
> > > > > > > aggregation
> > > > > > > > > > >> level to
> > > > > > > > > > >>>  namespace aggregation level. Also you can turn off
> > > > producer
> > > > > > and
> > > > > > > > > > >> consumer
> > > > > > > > > > >>>  level metrics. You end up doing it all leaving you
> > > > "blind",
> > > > > > > > looking
> > > > > > > > > at
> > > > > > > > > > >> the
> > > > > > > > > > >>>  metrics from a namespace level which is too high
> level.
> > > > You
> > > > > > end
> > > > > > > up
> > > > > > > > > > >>>  conjuring all kinds of scripts on top of topic stats
> > > > > endpoint
> > > > > > to
> > > > > > > > > glue
> > > > > > > > > > >> some
> > > > > > > > > > >>>  aggregated metrics view for the topics you need.
> > > > > > > > > > >>>  - Summaries (metric type giving you quantiles like
> p95)
> > > > > which
> > > > > > > are
> > > > > > > > > used
> > > > > > > > > > >>>  in Pulsar, can't be aggregated across topics /
> brokers
> > > due
> > > > > its
> > > > > > > > > inherent
> > > > > > > > > > >>>  design.
> > > > > > > > > > >>>  - Plugin authors spend too much time on defining and
> > > > > exposing
> > > > > > > > > metrics
> > > > > > > > > > >> to
> > > > > > > > > > >>>  Pulsar since the only interface Pulsar offers is
> writing
> > > > > your
> > > > > > > > > metrics
> > > > > > > > > > >> by
> > > > > > > > > > >>>  your self as UTF-8 bytes in Prometheus Text Format
> to
> > > byte
> > > > > > > stream
> > > > > > > > > > >> interface
> > > > > > > > > > >>>  given to you.
> > > > > > > > > > >>>  - Pulsar histograms are exported in a way that is
> not
> > > > > > conformant
> > > > > > > > > with
> > > > > > > > > > >>>  Prometheus, which means you can't get the p95
> quantile
> > > on
> > > > > such
> > > > > > > > > > >> histograms,
> > > > > > > > > > >>>  making them very hard to use in day to day life.
> > > > > > > > > > >>
> > > > > > > > > > >> What version of DataSketches is used to produce the
> > > > histogram?
> > > > > > Is
> > > > > > > is
> > > > > > > > > still
> > > > > > > > > > >> an old Yahoo one, or are we using an updated one from
> > > Apache
> > > > > > > > > DataSketches?
> > > > > > > > > > >>
> > > > > > > > > > >> Seems like this is a single PR/small PIP for 3.1?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Histograms are a list of buckets, each is a counter.
> > > > > > > > > > > Summary is a collection of values collected over a time
> > > > window,
> > > > > > > which
> > > > > > > > > at
> > > > > > > > > > > the end you get a calculation of the quantiles of those
> > > > values:
> > > > > > > p95,
> > > > > > > > > p50,
> > > > > > > > > > > and those are exported from Pulsar.
> > > > > > > > > > >
> > > > > > > > > > > Pulsar histogram do not use Data Sketches.
> > > > > > > > > >
> > > > > > > > > > Bookkeeper Metrics wraps Yahoo DataSketches last I
> checked.
> > > > > > > > > >
> > > > > > > > > > > They are just counters.
> > > > > > > > > > > They are not adhere to Prometheus since:
> > > > > > > > > > > a. The counter is expected to be cumulative, but Pulsar
> > > > resets
> > > > > > each
> > > > > > > > > bucket
> > > > > > > > > > > counter to 0 every 1 min
> > > > > > > > > > > b. The bucket upper range is expected to be written as
> an
> > > > > > attribute
> > > > > > > > > "le"
> > > > > > > > > > > but today it is encoded in the name of the metric
> itself.
> > > > > > > > > > >
> > > > > > > > > > > This is a breaking change, hence hard to mark in any
> small
> > > > > > release.
> > > > > > > > > > > This is why it's part of this PIP since so many things
> will
> > > > > > break,
> > > > > > > > and
> > > > > > > > > all
> > > > > > > > > > > of them will break on a separate layer (OTel metrics),
> > > hence
> > > > > not
> > > > > > > > break
> > > > > > > > > > > anyone without their consent.
> > > > > > > > > >
> > > > > > > > > > If this change will break existing Grafana dashboards and
> > > other
> > > > > > > > > operational monitoring already in place then it will break
> > > > > guarantees
> > > > > > > we
> > > > > > > > > have made about safely being able to downgrade from a bad
> > > > upgrade.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>>  - Too many metrics are rates which also delta reset
> > > every
> > > > > > > interval
> > > > > > > > > you
> > > > > > > > > > >>>  configure in Pulsar and restart, instead of relying
> on
> > > > > > > cumulative
> > > > > > > > > (ever
> > > > > > > > > > >>>  growing) counters and let Prometheus use its rate
> > > > function.
> > > > > > > > > > >>>  - and many more issues
> > > > > > > > > > >>>
> > > > > > > > > > >>> From the developer perspective:
> > > > > > > > > > >>>
> > > > > > > > > > >>>  - There are 4 different ways to define and record
> > > metrics
> > > > in
> > > > > > > > Pulsar:
> > > > > > > > > > >>>  Pulsar own metrics library, Prometheus Java Client,
> > > > > Bookkeeper
> > > > > > > > > metrics
> > > > > > > > > > >>>  library and plain native Java SDK objects
> (AtomicLong,
> > > > ...).
> > > > > > > It's
> > > > > > > > > very
> > > > > > > > > > >>>  confusing for the developer and create
> inconsistencies
> > > for
> > > > > the
> > > > > > > end
> > > > > > > > > user
> > > > > > > > > > >>>  (e.g. Summary for example is different in each).
> > > > > > > > > > >>>  - Patching your metrics into "/metrics" Prometheus
> > > > endpoint
> > > > > is
> > > > > > > > > > >>>  confusing, cumbersome and error prone.
> > > > > > > > > > >>>  - many more
> > > > > > > > > > >>>
> > > > > > > > > > >>> This proposal offers several key changes to solve
> that:
> > > > > > > > > > >>>
> > > > > > > > > > >>>  - Cardinality (supporting 10k-100k topics per
> broker) is
> > > > > > solved
> > > > > > > by
> > > > > > > > > > >>>  introducing a new aggregation level for metrics
> called
> > > > Topic
> > > > > > > > Metric
> > > > > > > > > > >> Group.
> > > > > > > > > > >>>  Using configuration, you specify for each topic its
> > > group
> > > > > > (using
> > > > > > > > > > >>>  wildcard/regex). This allows you to "zoom" out to a
> more
> > > > > > > detailed
> > > > > > > > > > >>>  granularity level like groups instead of namespaces,
> > > which
> > > > > you
> > > > > > > > > control
> > > > > > > > > > >> how
> > > > > > > > > > >>>  many groups you'll have hence solving the
> cardinality
> > > > issue,
> > > > > > > > without
> > > > > > > > > > >>>  sacrificing level of detail too much.
> > > > > > > > > > >>>  - Fine-grained filtering mechanism, dynamic. You'll
> have
> > > > > > > > rule-based
> > > > > > > > > > >>>  dynamic configuration, allowing you to specify per
> > > > > > > > > > >> namespace/topic/group
> > > > > > > > > > >>>  which metrics you'd like to keep/drop. Rules allows
> you
> > > to
> > > > > set
> > > > > > > the
> > > > > > > > > > >> default
> > > > > > > > > > >>>  to have small amount of metrics in group and
> namespace
> > > > level
> > > > > > > only
> > > > > > > > > and
> > > > > > > > > > >> drop
> > > > > > > > > > >>>  the rest. When needed, you can add an override rule
> to
> > > > > "open"
> > > > > > > up a
> > > > > > > > > > >> certain
> > > > > > > > > > >>>  group to have more metrics in higher granularity
> (topic
> > > or
> > > > > > even
> > > > > > > > > > >>>  consumer/producer level). Since it's dynamic you
> "open"
> > > > > such a
> > > > > > > > group
> > > > > > > > > > >> when
> > > > > > > > > > >>>  you see it's misbehaving, see it in topic level, and
> > > when
> > > > > all
> > > > > > > > > > >> resolved, you
> > > > > > > > > > >>>  can "close" it. A bit similar experience to logging
> > > levels
> > > > > in
> > > > > > > > Log4j
> > > > > > > > > or
> > > > > > > > > > >>>  Logback, that you default and override per
> > > class/package.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Aggregation and Filtering combined solves the
> cardinality
> > > > > > without
> > > > > > > > > > >>> sacrificing the level of detail when needed and most
> > > > > > importantly,
> > > > > > > > you
> > > > > > > > > > >>> determine which topic/group/namespace it happens on.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Since this change is so invasive, it requires a
> single
> > > > > metrics
> > > > > > > > > library to
> > > > > > > > > > >>> implement all of it on top of; Hence the third big
> change
> > > > > point
> > > > > > > is
> > > > > > > > > > >>> consolidating all four ways to define and record
> metrics
> > > > to a
> > > > > > > > single
> > > > > > > > > > >> one, a
> > > > > > > > > > >>> new one: OpenTelemtry Metrics (Java SDK, and also
> Python
> > > > and
> > > > > Go
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > >>> Pulsar Function runners).
> > > > > > > > > > >>> Introducing OpenTelemetry (OTel) solves also the
> biggest
> > > > pain
> > > > > > > point
> > > > > > > > > from
> > > > > > > > > > >>> the developer perspective, since it's a superb
> metrics
> > > > > library
> > > > > > > > > offering
> > > > > > > > > > >>> everything you need, and there is going to be a
> single
> > > way
> > > > -
> > > > > > only
> > > > > > > > it.
> > > > > > > > > > >> Also,
> > > > > > > > > > >>> it solves the robustness for Plugin author which
> will use
> > > > > > > > > OpenTelemetry.
> > > > > > > > > > >> It
> > > > > > > > > > >>> so happens that it also solves all the numerous
> problems
> > > > > > > described
> > > > > > > > > in the
> > > > > > > > > > >>> doc itself.
> > > > > > > > > > >>>
> > > > > > > > > > >>> The solution will be introduced as another layer with
> > > > feature
> > > > > > > > > toggles, so
> > > > > > > > > > >>> you can work with existing system, and/or OTel, until
> > > > > gradually
> > > > > > > > > > >> deprecating
> > > > > > > > > > >>> existing system.
> > > > > > > > > > >>>
> > > > > > > > > > >>> It's a big breaking change for Pulsar users on many
> > > fronts:
> > > > > > > names,
> > > > > > > > > > >>> semantics, configuration. Read at the end of this
> doc to
> > > > > learn
> > > > > > > > > exactly
> > > > > > > > > > >> what
> > > > > > > > > > >>> will change for the user (in high level).
> > > > > > > > > > >>>
> > > > > > > > > > >>> In my opinion, it will make Pulsar user experience so
> > > much
> > > > > > > better,
> > > > > > > > > they
> > > > > > > > > > >>> will want to migrate to it, despite the breaking
> change.
> > > > > > > > > > >>>
> > > > > > > > > > >>> This was a very short summary. You are most welcomed
> to
> > > > read
> > > > > > the
> > > > > > > > full
> > > > > > > > > > >>> design document below and express feedback, so we can
> > > make
> > > > it
> > > > > > > > better.
> > > > > > > > > > >>>
> > > > > > > > > > >>> On Sun, May 7, 2023 at 7:52 PM Asaf Mesika <
> > > > > > > asaf.mes...@gmail.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> On Sun, May 7, 2023 at 4:23 PM Yunze Xu
> > > > > > > > > <y...@streamnative.io.invalid>
> > > > > > > > > > >>>> wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>> I'm excited to learn much more about metrics when I
> > > > started
> > > > > > > > reading
> > > > > > > > > > >>>>> this proposal. But I became more and more
> frustrated
> > > > when I
> > > > > > > found
> > > > > > > > > > >>>>> there is still too much content left even if I've
> > > already
> > > > > > spent
> > > > > > > > > much
> > > > > > > > > > >>>>> time reading this proposal. I'm wondering how much
> time
> > > > did
> > > > > > you
> > > > > > > > > expect
> > > > > > > > > > >>>>> reviewers to read through this proposal? I just
> > > recalled
> > > > > the
> > > > > > > > > > >>>>> discussion you started before [1]. Did you expect
> each
> > > > PMC
> > > > > > > member
> > > > > > > > > that
> > > > > > > > > > >>>>> gives his/her +1 to read only parts of this
> proposal?
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I estimated around 2 hours needed for a reviewer.
> > > > > > > > > > >>>> I hate it being so long, but I simply couldn't find
> a
> > > way
> > > > to
> > > > > > > > > downsize it
> > > > > > > > > > >>>> more. Furthermore, I consulted with my colleagues
> > > > including
> > > > > > > > Matteo,
> > > > > > > > > but
> > > > > > > > > > >> we
> > > > > > > > > > >>>> couldn't see a way to scope it down.
> > > > > > > > > > >>>> Why? Because once you begin this journey, you need
> to
> > > know
> > > > > how
> > > > > > > > it's
> > > > > > > > > > >> going
> > > > > > > > > > >>>> to end.
> > > > > > > > > > >>>> What I ended up doing, is writing all the crucial
> > > details
> > > > > for
> > > > > > > > > review in
> > > > > > > > > > >>>> the High Level Design section.
> > > > > > > > > > >>>> It's still a big, hefty section, but I don't think
> I can
> > > > > step
> > > > > > > out
> > > > > > > > > or let
> > > > > > > > > > >>>> anyone else change Pulsar so invasively without the
> full
> > > > > > extent
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > >>>> change.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I don't think it's wise to read parts.
> > > > > > > > > > >>>> I did my very best effort to minimize it, but the
> scope
> > > is
> > > > > > > simply
> > > > > > > > > big.
> > > > > > > > > > >>>> Open for suggestions, but it requires reading all
> the
> > > PIP
> > > > :)
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Thanks a lot Yunze for dedicating any time to it.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Let's talk back to the proposal, for now, what I
> mainly
> > > > > > learned
> > > > > > > > and
> > > > > > > > > > >>>>> are concerned about mostly are:
> > > > > > > > > > >>>>> 1. Pulsar has many ways to expose metrics. It's not
> > > > unified
> > > > > > and
> > > > > > > > > > >> confusing.
> > > > > > > > > > >>>>> 2. The current metrics system cannot support a
> large
> > > > amount
> > > > > > of
> > > > > > > > > topics.
> > > > > > > > > > >>>>> 3. It's hard for plugin authors to integrate
> metrics.
> > > > (For
> > > > > > > > example,
> > > > > > > > > > >>>>> KoP [2] integrates metrics by implementing the
> > > > > > > > > > >>>>> PrometheusRawMetricsProvider interface and it
> indeed
> > > > needs
> > > > > > much
> > > > > > > > > work)
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Regarding the 1st issue, this proposal chooses
> > > > > OpenTelemetry
> > > > > > > > > (OTel).
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Regarding the 2nd issue, I scrolled to the "Why
> > > > > > OpenTelemetry?"
> > > > > > > > > > >>>>> section. It's still frustrating to see no answer.
> > > > > > Eventually, I
> > > > > > > > > found
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> OpenTelemetry isn't the solution for large amount of
> > > > topic.
> > > > > > > > > > >>>> The solution is described at
> > > > > > > > > > >>>> "Aggregate and Filtering to solve cardinality
> issues"
> > > > > section.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>> the explanation in the "What we need to fix in
> > > > > OpenTelemetry
> > > > > > -
> > > > > > > > > > >>>>> Performance" section. It seems that we still need
> some
> > > > > > > > > enhancements in
> > > > > > > > > > >>>>> OTel. In other words, currently OTel is not ready
> for
> > > > > > resolving
> > > > > > > > all
> > > > > > > > > > >>>>> these issues listed in the proposal but we believe
> it
> > > > will.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Let me rephrase "believe" --> we work together with
> the
> > > > > > > > maintainers
> > > > > > > > > to
> > > > > > > > > > >> do
> > > > > > > > > > >>>> it, yes.
> > > > > > > > > > >>>> I am open for any other suggestion.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> As for the 3rd issue, from the "Integrating with
> Pulsar
> > > > > > > Plugins"
> > > > > > > > > > >>>>> section, the plugin authors still need to
> implement the
> > > > new
> > > > > > > OTel
> > > > > > > > > > >>>>> interfaces. Is it much easier than using the
> existing
> > > > ways
> > > > > to
> > > > > > > > > expose
> > > > > > > > > > >>>>> metrics? Could metrics still be easily integrated
> with
> > > > > > Grafana?
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Yes, it's way easier.
> > > > > > > > > > >>>> Basically you have a full fledged metrics library
> > > objects:
> > > > > > > Meter,
> > > > > > > > > Gauge,
> > > > > > > > > > >>>> Histogram, Counter.
> > > > > > > > > > >>>> No more Raw Metrics Provider, writing UTF-8 bytes in
> > > > > > Prometheus
> > > > > > > > > format.
> > > > > > > > > > >>>> You get namespacing for free with Meter name and
> > > version.
> > > > > > > > > > >>>> It's way better than current solution and any other
> > > > library.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> That's all I am concerned about at the moment. I
> > > > > understand,
> > > > > > > and
> > > > > > > > > > >>>>> appreciate that you've spent much time studying and
> > > > > > explaining
> > > > > > > > all
> > > > > > > > > > >>>>> these things. But, this proposal is still too huge.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I appreciate your effort a lot!
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> [1]
> > > > > > > > >
> > > https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf
> > > > > > > > > > >>>>> [2]
> > > > > > > > > > >>>>>
> > > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > >>>>> Yunze
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> On Sun, May 7, 2023 at 5:53 PM Asaf Mesika <
> > > > > > > > asaf.mes...@gmail.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> I'm very appreciative for feedback from multiple
> > > pulsar
> > > > > > users
> > > > > > > > and
> > > > > > > > > devs
> > > > > > > > > > >>>>> on
> > > > > > > > > > >>>>>> this PIP, since it has dramatic changes suggested
> and
> > > > > quite
> > > > > > > > > extensive
> > > > > > > > > > >>>>>> positive change for the users.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika <
> > > > > > > > > asaf.mes...@gmail.com>
> > > > > > > > > > >>>>> wrote:
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>> Hi all,
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> I'm very excited to release a PIP I've been
> working
> > > on
> > > > in
> > > > > > the
> > > > > > > > > past 11
> > > > > > > > > > >>>>>>> months, which I think will be immensely valuable
> to
> > > > > Pulsar,
> > > > > > > > > which I
> > > > > > > > > > >>>>> like so
> > > > > > > > > > >>>>>>> much.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> PIP:
> https://github.com/apache/pulsar/issues/20197
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> I'm quoting here the preface:
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> === QUOTE START ===
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> Roughly 11 months ago, I started working on
> solving
> > > the
> > > > > > > biggest
> > > > > > > > > issue
> > > > > > > > > > >>>>> with
> > > > > > > > > > >>>>>>> Pulsar metrics: the lack of ability to monitor a
> > > pulsar
> > > > > > > broker
> > > > > > > > > with a
> > > > > > > > > > >>>>> large
> > > > > > > > > > >>>>>>> topic count: 10k, 100k, and future support of 1M.
> > > This
> > > > > > > started
> > > > > > > > by
> > > > > > > > > > >>>>> mapping
> > > > > > > > > > >>>>>>> the existing functionality and then enumerating
> all
> > > the
> > > > > > > > problems
> > > > > > > > > I
> > > > > > > > > > >>>>> saw (all
> > > > > > > > > > >>>>>>> documented in this doc
> > > > > > > > > > >>>>>>> <
> > > > > > > > > > >>>>>
> > > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing
> > > > > > > > > >
> > > > > > > > > > I thought we were going to stop using Google docs for
> PIPs.
> > > > > > > > > >
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>> ).
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> This PIP is a parent PIP. It aims to gradually
> solve
> > > > > (using
> > > > > > > > > sub-PIPs)
> > > > > > > > > > >>>>> all
> > > > > > > > > > >>>>>>> the current metric system's problems and provide
> the
> > > > > > ability
> > > > > > > to
> > > > > > > > > > >>>>> monitor a
> > > > > > > > > > >>>>>>> broker with a large topic count, which is
> currently
> > > > > > lacking.
> > > > > > > > As a
> > > > > > > > > > >>>>> parent
> > > > > > > > > > >>>>>>> PIP, it will describe each problem and its
> solution
> > > at
> > > > a
> > > > > > high
> > > > > > > > > level,
> > > > > > > > > > >>>>>>> leaving fine-grained details to the sub-PIPs. The
> > > > parent
> > > > > > PIP
> > > > > > > > > ensures
> > > > > > > > > > >>>>> all
> > > > > > > > > > >>>>>>> solutions align and does not contradict each
> other.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> The basic building block to solve the monitoring
> > > > ability
> > > > > of
> > > > > > > > large
> > > > > > > > > > >>>>> topic
> > > > > > > > > > >>>>>>> count is aggregating internally (to topic
> groups) and
> > > > > > adding
> > > > > > > > > > >>>>> fine-grained
> > > > > > > > > > >>>>>>> filtering. We could have shoe-horned it into the
> > > > existing
> > > > > > > > metric
> > > > > > > > > > >>>>> system,
> > > > > > > > > > >>>>>>> but we thought adding that to a system already
> > > > ingrained
> > > > > > with
> > > > > > > > > many
> > > > > > > > > > >>>>> problems
> > > > > > > > > > >>>>>>> would be wrong and hard to do gradually, as so
> many
> > > > > things
> > > > > > > will
> > > > > > > > > > >>>>> break. This
> > > > > > > > > > >>>>>>> is why the second-biggest design decision
> presented
> > > > here
> > > > > is
> > > > > > > > > > >>>>> consolidating
> > > > > > > > > > >>>>>>> all existing metric libraries into a single one -
> > > > > > > OpenTelemetry
> > > > > > > > > > >>>>>>> <https://opentelemetry.io/>. The parent PIP will
> > > > explain
> > > > > > why
> > > > > > > > > > >>>>>>> OpenTelemetry was chosen out of existing
> solutions
> > > and
> > > > > why
> > > > > > it
> > > > > > > > far
> > > > > > > > > > >>>>> exceeds
> > > > > > > > > > >>>>>>> all other options. I’ve been working closely
> with the
> > > > > > > > > OpenTelemetry
> > > > > > > > > > >>>>>>> community in the past eight months:
> brain-storming
> > > this
> > > > > > > > > integration,
> > > > > > > > > > >>>>> and
> > > > > > > > > > >>>>>>> raising issues, in an effort to remove serious
> > > blockers
> > > > > to
> > > > > > > make
> > > > > > > > > this
> > > > > > > > > > >>>>>>> migration successful.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> I made every effort to summarize this document so
> > > that
> > > > it
> > > > > > can
> > > > > > > > be
> > > > > > > > > > >>>>> concise
> > > > > > > > > > >>>>>>> yet clear. I understand it is an effort to read
> it
> > > and,
> > > > > > more
> > > > > > > > so,
> > > > > > > > > > >>>>> provide
> > > > > > > > > > >>>>>>> meaningful feedback on such a large document;
> hence
> > > I’m
> > > > > > very
> > > > > > > > > grateful
> > > > > > > > > > >>>>> for
> > > > > > > > > > >>>>>>> each individual who does so.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> I think this design will help improve the user
> > > > experience
> > > > > > > > > immensely,
> > > > > > > > > > >>>>> so it
> > > > > > > > > > >>>>>>> is worth the time spent reading it.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> === QUOTE END ===
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> Thanks!
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> Asaf Mesika
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Reply via email to