Re: [DISCUSS] PIP-264: Enhanced OTel-based metric system

Dave Fisher Tue, 24 Sep 2024 12:16:06 -0700

Hi -

> On Sep 24, 2024, at 11:01 AM, Heesung Sohn <hees...@apache.org> wrote:
> 
> Hi all,
> 
> I realized that sampling outlier topic/subscription/consumer/producer
> metrics can be useful to operate Pulsar.
> 
> I noticed that Pulsar's performance issues are mostly caused by
> outlier topics/subscription/consumers/producers.
> 
> For example, disk usage can grow a lot when consumers are very slow(0
> ack rate and 0 permit) while producers publish messages at a high rate
> to the same topic.


This is something you could observe while running an OpenMessaging Benchmark 
while creating and then draining a backlog.

> 
> I think sampling such outlier metrics can be useful to monitor Pulsar.
> Do we think this is a separate issue and deserves another PIP or
> should it be part of this PIP-264 work?

Even if it were not part of PIP-264 what about “outlier” metrics would require 
a new PIP?

Perhaps a better way of understanding such outliers is to take flame graphs of 
the brokers and bookkeepers you are testing with the OMB?

Best,
Dave

> 
> Thanks,
> Heesung
> 
> On Mon, Aug 28, 2023 at 7:48 AM Asaf Mesika <asaf.mes...@gmail.com> wrote:
>> 
>> I've relocated the PIP content from the issue (
>> https://github.com/apache/pulsar/issues/20197) to a PR (
>> https://github.com/apache/pulsar/pull/21080) so I could add TOC and also be
>> inlined with the new process.
>> 
>> 
>> 
>> On Mon, Aug 28, 2023 at 5:46 PM Asaf Mesika <asaf.mes...@gmail.com> wrote:
>> 
>>> Thanks for taking the time to review the document - *highly appreciated*.
>>> I'm inlined my comments below.
>>> 
>>> 
>>> On Mon, Aug 21, 2023 at 12:19 PM Hang Chen <chenh...@apache.org> wrote:
>>> 
>>>> Hi Asaf,
>>>>    Thanks for bring up the great proposal.
>>>> 
>>>> After reading this proposal, I have the following questions.
>>>> 1. This proposal will introduce a big break change in Pulsar,
>>>> especially in code perspective. I’m interested in how to support both
>>>> old and new implementation at the same time step by step?
>>>> 
>>>>> We will keep the current metric system as is, and add a new layer of
>>>> metrics using OpenTelemetry Java SDK. All of Pulsar’s metrics will be
>>>> create also using OpenTelemetry. A feature flag will allow enabling
>>>> OpenTelemetry metrics (init, recording and exporting). All the features and
>>>> changes described here will be done only in the OpenTelemetry layer,
>>>> allowing to keep the old version working until you’re ready to switch using
>>>> the OTel (OpenTelemetry) implementation. In the far future, once OTel usage
>>>> has stabilized and became widely adopted we’ll deprecate current metric
>>>> system and eventually remove it. We will also make sure there is feature
>>>> flag to turn off current Prometheus based metric system.
>>>> 
>>>> 
>>> Current metrics code remains as is, untouched.
>>> I'm adding new code, using OpenTelemetry API and SDK. The code in most
>>> cases will read the existing variables (like msgsReceived), and in other
>>> cases will setup its own new objects like Counter, Histogram and *also*
>>> record values to them.
>>> You can take a look at the revised PIP as I've added tiny code sample to
>>> be use as an idea how it will look like. Look here
>>> <https://github.com/apache/pulsar/blob/6ec0bde4127a54ab8e8bb67fb091c932fa2952a4/pip/pip-264.md#consolidating-to-opentelemetry>
>>> .
>>> 
>>> 
>>> 
>>>> 2. We introduced Group and filter logic in the metric system, and I
>>>> have the following concerns.
>>>> - We need to add protection logic or pre-validation for the group and
>>>> filter rules to avoid users mis-configured causes huge performance
>>>> impaction on Pulsar brokers
>>>> 
>>>> 
>>> Good call. I've added a note in the PIP, that we will reject any filter
>>> rules update if the expected number of data points exceed certain
>>> threshold. I left this as detail to be specified in the sub-PIP.
>>> 
>>> - We need to support expose all the topic-level metrics when the
>>>> Pulsar cluster just has thounds of topics
>>>> 
>>>> I've added a new goal: "- New system should support the maximum
>>> supported number of topics in current system (i.e. 4k topics) without
>>> filtering"
>>> 
>>> 
>>>> - Even though we introduced group and filter for the metrics, we still
>>>> can’t resolve large number of metrics exposed to Prometheus. Exposing
>>>> large a mount of data (100MB+) throughput HTTP endpoint in
>>>> ineffective. We can consider expose those metric data by Pulsar topic
>>>> and develop a Pulsar to Prometheus connector to write Pulsar metric
>>>> data to Prometheus in streaming mode instead of batch mode to reduce
>>>> the performace impaction
>>>> 
>>> 
>>> As I wrote in my PIP. If you find your self exporting 100MB or more of
>>> metric data *every* 30 seconds you will suffer from:
>>> * High cost of TSDB holding that (e.g. Prometheus, Cortex, VictoriaMetrics)
>>> * Query time out since there is too much data to read
>>> 
>>> Also, the bottleneck is not transfer time over the wire. It's mostly the
>>> memory needed by any TSDB to hold it for at least 2 hours before flushing
>>> it to disk - this it the most expensive of all.
>>> 
>>> At 100mb response size, filtering and grouping are a must.
>>> 
>>> 
>>> 
>>>> 
>>>> - Group and filter logic uses regular expressions extensively in
>>>> rules. Regular expression parsing and matching are CPU and time
>>>> intensive operations. We have push-down filter to reduce the generated
>>>> metrics number, but still can’t solve the regular expression matching
>>>> issues. If the user provide a complex regular expression for group and
>>>> filter rule, the metric generating thread will be the bottleneck and
>>>> will block other threads if we use synchronous call.
>>>> 
>>>> 
>>> I plan to use caching as wrote in the PIP. Roughly (instrument,
>>> attributes) -> boolean. It's basically as if we are adding one boolean to
>>> PersistentTopic class - it has so many properties and size added is
>>> negligible.
>>> 
>>> 
>>>> - Group and filter rule is a litter complex for users and we need to
>>>> provide a UI or tool to help user write the correct and effective
>>>> rules and show the new rules impaction on old rules.
>>>> 
>>> 
>>> We don't have UI for now in Pulsar. We will make sure pulsar CLI will be
>>> convenient enough.
>>> 
>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Hang
>>>> 
>>>> Matteo Merli <matteo.me...@gmail.com> 于2023年6月15日周四 23:14写道：
>>>>> 
>>>>>> Proposing a large breaking change (even if it's crucial) is the single
>>>>> fastest way to motivate your users to migrate to a different platform. I
>>>>> wish it wasn't the case, but it's the cold reality.
>>>>> 
>>>>> If you read the proposal, there is no real breaking change. There will
>>>> be a
>>>>> switch to choose the existing metrics or the new ones. The dashboards
>>>> will
>>>>> be updated and provided.
>>>>> 
>>>>> At the same time, the best sure way to motivate users to switch or not
>>>>> adopt a platform is to stick with confusing/inconsistent APIs/Metrics.
>>>>> 
>>>>> 
>>>>> --
>>>>> Matteo Merli
>>>>> <matteo.me...@gmail.com>
>>>>> 
>>>>> 
>>>>> On Wed, Jun 14, 2023 at 6:10 PM Devin Bost <devin.b...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>>> Thanks for the details, Devin. Curios - 'We' stands for which
>>>> company?
>>>>>> 
>>>>>> What do you mean? I was quoting Rajan when I said, "we."
>>>>>> 
>>>>>> 
>>>>>> Devin G. Bost
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 14, 2023 at 10:02 AM Asaf Mesika <asaf.mes...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Thanks for the details, Devin. Curios - 'We' stands for which
>>>> company?
>>>>>>> 
>>>>>>> Can you take a look at my previous response to see if it answers the
>>>>>>> concern you raised?
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jun 14, 2023 at 1:49 PM Devin Bost <devin.b...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Are we proposing a change to break existing metrics
>>>> compatibility
>>>>>>>>> (prometheus)? If that is the case then it's a big red flag as
>>>> it will
>>>>>>> be
>>>>>>>> a
>>>>>>>>> pain for any company to upgrade Pulsar as monitoring is THE most
>>>>>>>> important
>>>>>>>>> part of the system and we don't even want to break
>>>> compatibility for
>>>>>>> any
>>>>>>>>> small things to avoid interruption for users that are using
>>>> Pulsar
>>>>>>>> system.
>>>>>>>>> I think it's always good to enhance a system by maintaining
>>>>>>> compatibility
>>>>>>>>> and I would be fine if we can introduce new metrics API without
>>>>>> causing
>>>>>>>> ANY
>>>>>>>>> interruption to existing metrics API. But if we can't maintain
>>>>>>>>> compatibility then it's a big red flag and not acceptable for
>>>> the
>>>>>>> Pulsar
>>>>>>>>> community.
>>>>>>>> 
>>>>>>>> Proposing a large breaking change (even if it's crucial) is the
>>>> single
>>>>>>>> fastest way to motivate your users to migrate to a different
>>>> platform.
>>>>>> I
>>>>>>>> wish it wasn't the case, but it's the cold reality.
>>>>>>>> 
>>>>>>>> With that said, I'm a big proponent of Open Telemetry. I did a big
>>>>>> video
>>>>>>> a
>>>>>>>> while back that some of you may remember on the use of Open
>>>> Tracing
>>>>>>> (before
>>>>>>>> it was merged into Open Telemetry). Open Telemetry has gained
>>>>>>> considerable
>>>>>>>> momentum in the industry since then.
>>>>>>>> 
>>>>>>>> I'm also very interested in a solution to the metrics problem.
>>>> I've run
>>>>>>>> into the scalability issues with metrics in production, and I've
>>>> been
>>>>>>> very
>>>>>>>> concerned about the metrics bottlenecks around our ability to
>>>> deliver
>>>>>> our
>>>>>>>> promises around supporting large numbers of topics. One of the big
>>>>>>>> advantages of Pulsar over Kafka is supposed to be that topics are
>>>>>> cheap,
>>>>>>>> but as it stands, our current metrics design gets seriously in
>>>> the way
>>>>>> of
>>>>>>>> that. Generally speaking, I'm open to solutions, especially if
>>>> they
>>>>>> align
>>>>>>>> us with a growing industry standard.
>>>>>>>> 
>>>>>>>> - Devin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Jun 14, 2023, 3:28 AM Enrico Olivelli <
>>>> eolive...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Il Mer 14 Giu 2023, 04:33 Rajan Dhabalia <rdhaba...@apache.org>
>>>> ha
>>>>>>>>> scritto:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Are we proposing a change to break existing metrics
>>>> compatibility
>>>>>>>>>> (prometheus)? If that is the case then it's a big red flag as
>>>> it
>>>>>> will
>>>>>>>> be
>>>>>>>>> a
>>>>>>>>>> pain for any company to upgrade Pulsar as monitoring is THE
>>>> most
>>>>>>>>> important
>>>>>>>>>> part of the system and we don't even want to break
>>>> compatibility
>>>>>> for
>>>>>>>> any
>>>>>>>>>> small things to avoid interruption for users that are using
>>>> Pulsar
>>>>>>>>> system.
>>>>>>>>>> I think it's always good to enhance a system by maintaining
>>>>>>>> compatibility
>>>>>>>>>> and I would be fine if we can introduce new metrics API
>>>> without
>>>>>>> causing
>>>>>>>>> ANY
>>>>>>>>>> interruption to existing metrics API. But if we can't maintain
>>>>>>>>>> compatibility then it's a big red flag and not acceptable for
>>>> the
>>>>>>>> Pulsar
>>>>>>>>>> community.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I agree.
>>>>>>>>> 
>>>>>>>>> If it is possible to export data Ina way that is compatible with
>>>>>>>> Prometheus
>>>>>>>>> without adding too much overhead then I would support this work.
>>>>>>>>> 
>>>>>>>>> About renaming the metrics: we can do it only if tue changes for
>>>>>> users
>>>>>>>> are
>>>>>>>>> as trivial as replacing the queries in the grafana dashboard or
>>>> in
>>>>>>>> alerting
>>>>>>>>> systems.
>>>>>>>>> 
>>>>>>>>> Asaf, do you have prototype? Built over any version of Pulsar?
>>>>>>>>> 
>>>>>>>>> Also, it would be very useful to start an initiative to collect
>>>> the
>>>>>>> list
>>>>>>>> of
>>>>>>>>> metrics that people really use in production, especially for
>>>>>> automated
>>>>>>>>> alerts.
>>>>>>>>> 
>>>>>>>>> In my experience you usually care about:
>>>>>>>>> - in/out traffic (rates, bytes...)
>>>>>>>>> - number of producer, consumers, topics, subscriptions...
>>>>>>>>> - backlog
>>>>>>>>> - jvm metrics
>>>>>>>>> - function custom metrics
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Enrico
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Rajan
>>>>>>>>>> 
>>>>>>>>>> On Sun, May 21, 2023 at 9:01 AM Asaf Mesika <
>>>> asaf.mes...@gmail.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks for the reply, Enrico.
>>>>>>>>>>> Completely agree.
>>>>>>>>>>> This made me realize my TL;DR wasn't talking about export.
>>>>>>>>>>> I added this to it:
>>>>>>>>>>> 
>>>>>>>>>>> ---
>>>>>>>>>>> Pulsar OTel Metrics will support exporting as Prometheus
>>>> HTTP
>>>>>>>> endpoint
>>>>>>>>>>> (`/metrics` but different port) for backward compatibility
>>>> and
>>>>>> also
>>>>>>>>> OLTP,
>>>>>>>>>>> so you can push the metrics to OTel Collector and from
>>>> there ship
>>>>>>> it
>>>>>>>> to
>>>>>>>>>> any
>>>>>>>>>>> destination.
>>>>>>>>>>> ---
>>>>>>>>>>> 
>>>>>>>>>>> OTel supports two kinds of exporter: Prometheus (HTTP) and
>>>> OTLP
>>>>>>>> (push).
>>>>>>>>>>> We'll just configure to use them.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, May 15, 2023 at 10:35 AM Enrico Olivelli <
>>>>>>>> eolive...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Asaf,
>>>>>>>>>>>> thanks for contributing in this area.
>>>>>>>>>>>> Metrics are a fundamental feature of Pulsar.
>>>>>>>>>>>> 
>>>>>>>>>>>> Currently I find it very awkward to maintain metrics, and
>>>> also
>>>>>> I
>>>>>>>> see
>>>>>>>>>>>> it as a problem to support only Prometheus.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regarding your proposal, IIRC in the past someone else
>>>> proposed
>>>>>>> to
>>>>>>>>>>>> support other metrics systems and they have been
>>>> suggested to
>>>>>>> use a
>>>>>>>>>>>> sidecar approach,
>>>>>>>>>>>> that is to add something next to Pulsar services that
>>>> served
>>>>>> the
>>>>>>>>>>>> metrics in the preferred format/way.
>>>>>>>>>>>> I find that the sidecar approach is too inefficient and I
>>>> am
>>>>>> not
>>>>>>>>>>>> proposing it (but I wanted to add this reference for the
>>>>>> benefit
>>>>>>> of
>>>>>>>>>>>> new people on the list).
>>>>>>>>>>>> 
>>>>>>>>>>>> I wonder if it would be possible to keep compatibility
>>>> with the
>>>>>>>>>>>> current Prometheus based metrics.
>>>>>>>>>>>> Now Pulsar reached a point in which is is widely used by
>>>> many
>>>>>>>>>>>> companies and also with big clusters,
>>>>>>>>>>>> telling people that they have to rework all the
>>>> infrastructure
>>>>>>>>> related
>>>>>>>>>>>> to metrics because we don't support Prometheus anymore or
>>>>>> because
>>>>>>>> we
>>>>>>>>>>>> changed radically the way we publish metrics
>>>>>>>>>>>> It is a step that seems too hard from my point of view.
>>>>>>>>>>>> 
>>>>>>>>>>>> Currently I believe that compatibility is more important
>>>> than
>>>>>>>>>>>> versatility, and if we want to introduce new (and far
>>>> better)
>>>>>>>>> features
>>>>>>>>>>>> we must take it into account.
>>>>>>>>>>>> 
>>>>>>>>>>>> So my point is that I generally support the idea of
>>>> opening the
>>>>>>> way
>>>>>>>>> to
>>>>>>>>>>>> Open Telemetry, but we must have a way to not force all
>>>> of our
>>>>>>>> users
>>>>>>>>>>>> to throw away their alerting systems, dashboards and
>>>> know-how
>>>>>> in
>>>>>>>>>>>> troubleshooting Pulsar problems in production and dev
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards
>>>>>>>>>>>> Enrico
>>>>>>>>>>>> 
>>>>>>>>>>>> Il giorno lun 15 mag 2023 alle ore 02:17 Dave Fisher
>>>>>>>>>>>> <wave4d...@comcast.net> ha scritto:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On May 10, 2023, at 1:01 AM, Asaf Mesika <
>>>>>>>> asaf.mes...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, May 9, 2023 at 11:29 PM Dave Fisher <
>>>>>>> w...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On May 8, 2023, at 2:49 AM, Asaf Mesika <
>>>>>>>>> asaf.mes...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Your feedback made me realized I need to add "TL;DR"
>>>>>>> section,
>>>>>>>>>>> which I
>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm quoting it here. It gives a brief summary of the
>>>>>>>> proposal,
>>>>>>>>>>> which
>>>>>>>>>>>>>>>> requires up to 5 min of read time, helping you get
>>>> a high
>>>>>>>> level
>>>>>>>>>>>> picture
>>>>>>>>>>>>>>>> before you dive into the
>>>> background/motivation/solution.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ----------------------
>>>>>>>>>>>>>>>> TL;DR
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Working with Metrics today as a user or a developer
>>>> is
>>>>>> hard
>>>>>>>> and
>>>>>>>>>> has
>>>>>>>>>>>> many
>>>>>>>>>>>>>>>> severe issues.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> From the user perspective:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - One of Pulsar strongest feature is "cheap"
>>>> topics so
>>>>>> you
>>>>>>>> can
>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>> have 10k - 100k topics per broker. Once you do
>>>> that, you
>>>>>>>>> quickly
>>>>>>>>>>>> learn
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> the amount of metrics you export via "/metrics"
>>>>>>> (Prometheus
>>>>>>>>>> style
>>>>>>>>>>>>>>> endpoint)
>>>>>>>>>>>>>>>> becomes really big. The cost to store them becomes
>>>> too
>>>>>>> high,
>>>>>>>>>>> queries
>>>>>>>>>>>>>>>> time-out or even "/metrics" endpoint it self times
>>>> out.
>>>>>>>>>>>>>>>> The only option Pulsar gives you today is
>>>> all-or-nothing
>>>>>>>>>> filtering
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> very crude aggregation. You switch metrics from
>>>> topic
>>>>>>>>>> aggregation
>>>>>>>>>>>>>>> level to
>>>>>>>>>>>>>>>> namespace aggregation level. Also you can turn off
>>>>>>> producer
>>>>>>>>> and
>>>>>>>>>>>>>>> consumer
>>>>>>>>>>>>>>>> level metrics. You end up doing it all leaving you
>>>>>>> "blind",
>>>>>>>>>>> looking
>>>>>>>>>>>> at
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> metrics from a namespace level which is too high
>>>> level.
>>>>>>> You
>>>>>>>>> end
>>>>>>>>>> up
>>>>>>>>>>>>>>>> conjuring all kinds of scripts on top of topic
>>>> stats
>>>>>>>> endpoint
>>>>>>>>> to
>>>>>>>>>>>> glue
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> aggregated metrics view for the topics you need.
>>>>>>>>>>>>>>>> - Summaries (metric type giving you quantiles like
>>>> p95)
>>>>>>>> which
>>>>>>>>>> are
>>>>>>>>>>>> used
>>>>>>>>>>>>>>>> in Pulsar, can't be aggregated across topics /
>>>> brokers
>>>>>> due
>>>>>>>> its
>>>>>>>>>>>> inherent
>>>>>>>>>>>>>>>> design.
>>>>>>>>>>>>>>>> - Plugin authors spend too much time on defining
>>>> and
>>>>>>>> exposing
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> Pulsar since the only interface Pulsar offers is
>>>> writing
>>>>>>>> your
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>> your self as UTF-8 bytes in Prometheus Text Format
>>>> to
>>>>>> byte
>>>>>>>>>> stream
>>>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>>> given to you.
>>>>>>>>>>>>>>>> - Pulsar histograms are exported in a way that is
>>>> not
>>>>>>>>> conformant
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> Prometheus, which means you can't get the p95
>>>> quantile
>>>>>> on
>>>>>>>> such
>>>>>>>>>>>>>>> histograms,
>>>>>>>>>>>>>>>> making them very hard to use in day to day life.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What version of DataSketches is used to produce the
>>>>>>> histogram?
>>>>>>>>> Is
>>>>>>>>>> is
>>>>>>>>>>>> still
>>>>>>>>>>>>>>> an old Yahoo one, or are we using an updated one from
>>>>>> Apache
>>>>>>>>>>>> DataSketches?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Seems like this is a single PR/small PIP for 3.1?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Histograms are a list of buckets, each is a counter.
>>>>>>>>>>>>>> Summary is a collection of values collected over a
>>>> time
>>>>>>> window,
>>>>>>>>>> which
>>>>>>>>>>>> at
>>>>>>>>>>>>>> the end you get a calculation of the quantiles of
>>>> those
>>>>>>> values:
>>>>>>>>>> p95,
>>>>>>>>>>>> p50,
>>>>>>>>>>>>>> and those are exported from Pulsar.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Pulsar histogram do not use Data Sketches.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Bookkeeper Metrics wraps Yahoo DataSketches last I
>>>> checked.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> They are just counters.
>>>>>>>>>>>>>> They are not adhere to Prometheus since:
>>>>>>>>>>>>>> a. The counter is expected to be cumulative, but
>>>> Pulsar
>>>>>>> resets
>>>>>>>>> each
>>>>>>>>>>>> bucket
>>>>>>>>>>>>>> counter to 0 every 1 min
>>>>>>>>>>>>>> b. The bucket upper range is expected to be written
>>>> as an
>>>>>>>>> attribute
>>>>>>>>>>>> "le"
>>>>>>>>>>>>>> but today it is encoded in the name of the metric
>>>> itself.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is a breaking change, hence hard to mark in any
>>>> small
>>>>>>>>> release.
>>>>>>>>>>>>>> This is why it's part of this PIP since so many
>>>> things will
>>>>>>>>> break,
>>>>>>>>>>> and
>>>>>>>>>>>> all
>>>>>>>>>>>>>> of them will break on a separate layer (OTel metrics),
>>>>>> hence
>>>>>>>> not
>>>>>>>>>>> break
>>>>>>>>>>>>>> anyone without their consent.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If this change will break existing Grafana dashboards
>>>> and
>>>>>> other
>>>>>>>>>>>> operational monitoring already in place then it will break
>>>>>>>> guarantees
>>>>>>>>>> we
>>>>>>>>>>>> have made about safely being able to downgrade from a bad
>>>>>>> upgrade.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Too many metrics are rates which also delta reset
>>>>>> every
>>>>>>>>>> interval
>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> configure in Pulsar and restart, instead of
>>>> relying on
>>>>>>>>>> cumulative
>>>>>>>>>>>> (ever
>>>>>>>>>>>>>>>> growing) counters and let Prometheus use its rate
>>>>>>> function.
>>>>>>>>>>>>>>>> - and many more issues
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> From the developer perspective:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - There are 4 different ways to define and record
>>>>>> metrics
>>>>>>> in
>>>>>>>>>>> Pulsar:
>>>>>>>>>>>>>>>> Pulsar own metrics library, Prometheus Java Client,
>>>>>>>> Bookkeeper
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>> library and plain native Java SDK objects
>>>> (AtomicLong,
>>>>>>> ...).
>>>>>>>>>> It's
>>>>>>>>>>>> very
>>>>>>>>>>>>>>>> confusing for the developer and create
>>>> inconsistencies
>>>>>> for
>>>>>>>> the
>>>>>>>>>> end
>>>>>>>>>>>> user
>>>>>>>>>>>>>>>> (e.g. Summary for example is different in each).
>>>>>>>>>>>>>>>> - Patching your metrics into "/metrics" Prometheus
>>>>>>> endpoint
>>>>>>>> is
>>>>>>>>>>>>>>>> confusing, cumbersome and error prone.
>>>>>>>>>>>>>>>> - many more
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This proposal offers several key changes to solve
>>>> that:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Cardinality (supporting 10k-100k topics per
>>>> broker) is
>>>>>>>>> solved
>>>>>>>>>> by
>>>>>>>>>>>>>>>> introducing a new aggregation level for metrics
>>>> called
>>>>>>> Topic
>>>>>>>>>>> Metric
>>>>>>>>>>>>>>> Group.
>>>>>>>>>>>>>>>> Using configuration, you specify for each topic its
>>>>>> group
>>>>>>>>> (using
>>>>>>>>>>>>>>>> wildcard/regex). This allows you to "zoom" out to
>>>> a more
>>>>>>>>>> detailed
>>>>>>>>>>>>>>>> granularity level like groups instead of
>>>> namespaces,
>>>>>> which
>>>>>>>> you
>>>>>>>>>>>> control
>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>> many groups you'll have hence solving the
>>>> cardinality
>>>>>>> issue,
>>>>>>>>>>> without
>>>>>>>>>>>>>>>> sacrificing level of detail too much.
>>>>>>>>>>>>>>>> - Fine-grained filtering mechanism, dynamic.
>>>> You'll have
>>>>>>>>>>> rule-based
>>>>>>>>>>>>>>>> dynamic configuration, allowing you to specify per
>>>>>>>>>>>>>>> namespace/topic/group
>>>>>>>>>>>>>>>> which metrics you'd like to keep/drop. Rules
>>>> allows you
>>>>>> to
>>>>>>>> set
>>>>>>>>>> the
>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>> to have small amount of metrics in group and
>>>> namespace
>>>>>>> level
>>>>>>>>>> only
>>>>>>>>>>>> and
>>>>>>>>>>>>>>> drop
>>>>>>>>>>>>>>>> the rest. When needed, you can add an override
>>>> rule to
>>>>>>>> "open"
>>>>>>>>>> up a
>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>> group to have more metrics in higher granularity
>>>> (topic
>>>>>> or
>>>>>>>>> even
>>>>>>>>>>>>>>>> consumer/producer level). Since it's dynamic you
>>>> "open"
>>>>>>>> such a
>>>>>>>>>>> group
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> you see it's misbehaving, see it in topic level,
>>>> and
>>>>>> when
>>>>>>>> all
>>>>>>>>>>>>>>> resolved, you
>>>>>>>>>>>>>>>> can "close" it. A bit similar experience to logging
>>>>>> levels
>>>>>>>> in
>>>>>>>>>>> Log4j
>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> Logback, that you default and override per
>>>>>> class/package.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Aggregation and Filtering combined solves the
>>>> cardinality
>>>>>>>>> without
>>>>>>>>>>>>>>>> sacrificing the level of detail when needed and most
>>>>>>>>> importantly,
>>>>>>>>>>> you
>>>>>>>>>>>>>>>> determine which topic/group/namespace it happens on.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Since this change is so invasive, it requires a
>>>> single
>>>>>>>> metrics
>>>>>>>>>>>> library to
>>>>>>>>>>>>>>>> implement all of it on top of; Hence the third big
>>>> change
>>>>>>>> point
>>>>>>>>>> is
>>>>>>>>>>>>>>>> consolidating all four ways to define and record
>>>> metrics
>>>>>>> to a
>>>>>>>>>>> single
>>>>>>>>>>>>>>> one, a
>>>>>>>>>>>>>>>> new one: OpenTelemtry Metrics (Java SDK, and also
>>>> Python
>>>>>>> and
>>>>>>>> Go
>>>>>>>>>> for
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> Pulsar Function runners).
>>>>>>>>>>>>>>>> Introducing OpenTelemetry (OTel) solves also the
>>>> biggest
>>>>>>> pain
>>>>>>>>>> point
>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the developer perspective, since it's a superb
>>>> metrics
>>>>>>>> library
>>>>>>>>>>>> offering
>>>>>>>>>>>>>>>> everything you need, and there is going to be a
>>>> single
>>>>>> way
>>>>>>> -
>>>>>>>>> only
>>>>>>>>>>> it.
>>>>>>>>>>>>>>> Also,
>>>>>>>>>>>>>>>> it solves the robustness for Plugin author which
>>>> will use
>>>>>>>>>>>> OpenTelemetry.
>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>> so happens that it also solves all the numerous
>>>> problems
>>>>>>>>>> described
>>>>>>>>>>>> in the
>>>>>>>>>>>>>>>> doc itself.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The solution will be introduced as another layer
>>>> with
>>>>>>> feature
>>>>>>>>>>>> toggles, so
>>>>>>>>>>>>>>>> you can work with existing system, and/or OTel,
>>>> until
>>>>>>>> gradually
>>>>>>>>>>>>>>> deprecating
>>>>>>>>>>>>>>>> existing system.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It's a big breaking change for Pulsar users on many
>>>>>> fronts:
>>>>>>>>>> names,
>>>>>>>>>>>>>>>> semantics, configuration. Read at the end of this
>>>> doc to
>>>>>>>> learn
>>>>>>>>>>>> exactly
>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>> will change for the user (in high level).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In my opinion, it will make Pulsar user experience
>>>> so
>>>>>> much
>>>>>>>>>> better,
>>>>>>>>>>>> they
>>>>>>>>>>>>>>>> will want to migrate to it, despite the breaking
>>>> change.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This was a very short summary. You are most
>>>> welcomed to
>>>>>>> read
>>>>>>>>> the
>>>>>>>>>>> full
>>>>>>>>>>>>>>>> design document below and express feedback, so we
>>>> can
>>>>>> make
>>>>>>> it
>>>>>>>>>>> better.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sun, May 7, 2023 at 7:52 PM Asaf Mesika <
>>>>>>>>>> asaf.mes...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Sun, May 7, 2023 at 4:23 PM Yunze Xu
>>>>>>>>>>>> <y...@streamnative.io.invalid>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I'm excited to learn much more about metrics when
>>>> I
>>>>>>> started
>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>> this proposal. But I became more and more
>>>> frustrated
>>>>>>> when I
>>>>>>>>>> found
>>>>>>>>>>>>>>>>>> there is still too much content left even if I've
>>>>>> already
>>>>>>>>> spent
>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>>> time reading this proposal. I'm wondering how
>>>> much time
>>>>>>> did
>>>>>>>>> you
>>>>>>>>>>>> expect
>>>>>>>>>>>>>>>>>> reviewers to read through this proposal? I just
>>>>>> recalled
>>>>>>>> the
>>>>>>>>>>>>>>>>>> discussion you started before [1]. Did you expect
>>>> each
>>>>>>> PMC
>>>>>>>>>> member
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> gives his/her +1 to read only parts of this
>>>> proposal?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I estimated around 2 hours needed for a reviewer.
>>>>>>>>>>>>>>>>> I hate it being so long, but I simply couldn't
>>>> find a
>>>>>> way
>>>>>>> to
>>>>>>>>>>>> downsize it
>>>>>>>>>>>>>>>>> more. Furthermore, I consulted with my colleagues
>>>>>>> including
>>>>>>>>>>> Matteo,
>>>>>>>>>>>> but
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> couldn't see a way to scope it down.
>>>>>>>>>>>>>>>>> Why? Because once you begin this journey, you need
>>>> to
>>>>>> know
>>>>>>>> how
>>>>>>>>>>> it's
>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>> to end.
>>>>>>>>>>>>>>>>> What I ended up doing, is writing all the crucial
>>>>>> details
>>>>>>>> for
>>>>>>>>>>>> review in
>>>>>>>>>>>>>>>>> the High Level Design section.
>>>>>>>>>>>>>>>>> It's still a big, hefty section, but I don't think
>>>> I can
>>>>>>>> step
>>>>>>>>>> out
>>>>>>>>>>>> or let
>>>>>>>>>>>>>>>>> anyone else change Pulsar so invasively without
>>>> the full
>>>>>>>>> extent
>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> change.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I don't think it's wise to read parts.
>>>>>>>>>>>>>>>>> I did my very best effort to minimize it, but the
>>>> scope
>>>>>> is
>>>>>>>>>> simply
>>>>>>>>>>>> big.
>>>>>>>>>>>>>>>>> Open for suggestions, but it requires reading all
>>>> the
>>>>>> PIP
>>>>>>> :)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks a lot Yunze for dedicating any time to it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Let's talk back to the proposal, for now, what I
>>>> mainly
>>>>>>>>> learned
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> are concerned about mostly are:
>>>>>>>>>>>>>>>>>> 1. Pulsar has many ways to expose metrics. It's
>>>> not
>>>>>>> unified
>>>>>>>>> and
>>>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>> 2. The current metrics system cannot support a
>>>> large
>>>>>>> amount
>>>>>>>>> of
>>>>>>>>>>>> topics.
>>>>>>>>>>>>>>>>>> 3. It's hard for plugin authors to integrate
>>>> metrics.
>>>>>>> (For
>>>>>>>>>>> example,
>>>>>>>>>>>>>>>>>> KoP [2] integrates metrics by implementing the
>>>>>>>>>>>>>>>>>> PrometheusRawMetricsProvider interface and it
>>>> indeed
>>>>>>> needs
>>>>>>>>> much
>>>>>>>>>>>> work)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Regarding the 1st issue, this proposal chooses
>>>>>>>> OpenTelemetry
>>>>>>>>>>>> (OTel).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Regarding the 2nd issue, I scrolled to the "Why
>>>>>>>>> OpenTelemetry?"
>>>>>>>>>>>>>>>>>> section. It's still frustrating to see no answer.
>>>>>>>>> Eventually, I
>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> OpenTelemetry isn't the solution for large amount
>>>> of
>>>>>>> topic.
>>>>>>>>>>>>>>>>> The solution is described at
>>>>>>>>>>>>>>>>> "Aggregate and Filtering to solve cardinality
>>>> issues"
>>>>>>>> section.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> the explanation in the "What we need to fix in
>>>>>>>> OpenTelemetry
>>>>>>>>> -
>>>>>>>>>>>>>>>>>> Performance" section. It seems that we still need
>>>> some
>>>>>>>>>>>> enhancements in
>>>>>>>>>>>>>>>>>> OTel. In other words, currently OTel is not ready
>>>> for
>>>>>>>>> resolving
>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> these issues listed in the proposal but we
>>>> believe it
>>>>>>> will.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Let me rephrase "believe" --> we work together
>>>> with the
>>>>>>>>>>> maintainers
>>>>>>>>>>>> to
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> it, yes.
>>>>>>>>>>>>>>>>> I am open for any other suggestion.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> As for the 3rd issue, from the "Integrating with
>>>> Pulsar
>>>>>>>>>> Plugins"
>>>>>>>>>>>>>>>>>> section, the plugin authors still need to
>>>> implement the
>>>>>>> new
>>>>>>>>>> OTel
>>>>>>>>>>>>>>>>>> interfaces. Is it much easier than using the
>>>> existing
>>>>>>> ways
>>>>>>>> to
>>>>>>>>>>>> expose
>>>>>>>>>>>>>>>>>> metrics? Could metrics still be easily integrated
>>>> with
>>>>>>>>> Grafana?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yes, it's way easier.
>>>>>>>>>>>>>>>>> Basically you have a full fledged metrics library
>>>>>> objects:
>>>>>>>>>> Meter,
>>>>>>>>>>>> Gauge,
>>>>>>>>>>>>>>>>> Histogram, Counter.
>>>>>>>>>>>>>>>>> No more Raw Metrics Provider, writing UTF-8 bytes
>>>> in
>>>>>>>>> Prometheus
>>>>>>>>>>>> format.
>>>>>>>>>>>>>>>>> You get namespacing for free with Meter name and
>>>>>> version.
>>>>>>>>>>>>>>>>> It's way better than current solution and any other
>>>>>>> library.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> That's all I am concerned about at the moment. I
>>>>>>>> understand,
>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> appreciate that you've spent much time studying
>>>> and
>>>>>>>>> explaining
>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> these things. But, this proposal is still too
>>>> huge.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I appreciate your effort a lot!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>> https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf
>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Yunze
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Sun, May 7, 2023 at 5:53 PM Asaf Mesika <
>>>>>>>>>>> asaf.mes...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I'm very appreciative for feedback from multiple
>>>>>> pulsar
>>>>>>>>> users
>>>>>>>>>>> and
>>>>>>>>>>>> devs
>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> this PIP, since it has dramatic changes
>>>> suggested and
>>>>>>>> quite
>>>>>>>>>>>> extensive
>>>>>>>>>>>>>>>>>>> positive change for the users.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika <
>>>>>>>>>>>> asaf.mes...@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm very excited to release a PIP I've been
>>>> working
>>>>>> on
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>> past 11
>>>>>>>>>>>>>>>>>>>> months, which I think will be immensely
>>>> valuable to
>>>>>>>> Pulsar,
>>>>>>>>>>>> which I
>>>>>>>>>>>>>>>>>> like so
>>>>>>>>>>>>>>>>>>>> much.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> PIP:
>>>> https://github.com/apache/pulsar/issues/20197
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm quoting here the preface:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> === QUOTE START ===
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Roughly 11 months ago, I started working on
>>>> solving
>>>>>> the
>>>>>>>>>> biggest
>>>>>>>>>>>> issue
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> Pulsar metrics: the lack of ability to monitor a
>>>>>> pulsar
>>>>>>>>>> broker
>>>>>>>>>>>> with a
>>>>>>>>>>>>>>>>>> large
>>>>>>>>>>>>>>>>>>>> topic count: 10k, 100k, and future support of
>>>> 1M.
>>>>>> This
>>>>>>>>>> started
>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>>>>>>> the existing functionality and then enumerating
>>>> all
>>>>>> the
>>>>>>>>>>> problems
>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> saw (all
>>>>>>>>>>>>>>>>>>>> documented in this doc
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I thought we were going to stop using Google docs for
>>>> PIPs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> This PIP is a parent PIP. It aims to gradually
>>>> solve
>>>>>>>> (using
>>>>>>>>>>>> sub-PIPs)
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> the current metric system's problems and
>>>> provide the
>>>>>>>>> ability
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> monitor a
>>>>>>>>>>>>>>>>>>>> broker with a large topic count, which is
>>>> currently
>>>>>>>>> lacking.
>>>>>>>>>>> As a
>>>>>>>>>>>>>>>>>> parent
>>>>>>>>>>>>>>>>>>>> PIP, it will describe each problem and its
>>>> solution
>>>>>> at
>>>>>>> a
>>>>>>>>> high
>>>>>>>>>>>> level,
>>>>>>>>>>>>>>>>>>>> leaving fine-grained details to the sub-PIPs.
>>>> The
>>>>>>> parent
>>>>>>>>> PIP
>>>>>>>>>>>> ensures
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> solutions align and does not contradict each
>>>> other.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> The basic building block to solve the monitoring
>>>>>>> ability
>>>>>>>> of
>>>>>>>>>>> large
>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>> count is aggregating internally (to topic
>>>> groups) and
>>>>>>>>> adding
>>>>>>>>>>>>>>>>>> fine-grained
>>>>>>>>>>>>>>>>>>>> filtering. We could have shoe-horned it into the
>>>>>>> existing
>>>>>>>>>>> metric
>>>>>>>>>>>>>>>>>> system,
>>>>>>>>>>>>>>>>>>>> but we thought adding that to a system already
>>>>>>> ingrained
>>>>>>>>> with
>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>> problems
>>>>>>>>>>>>>>>>>>>> would be wrong and hard to do gradually, as so
>>>> many
>>>>>>>> things
>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> break. This
>>>>>>>>>>>>>>>>>>>> is why the second-biggest design decision
>>>> presented
>>>>>>> here
>>>>>>>> is
>>>>>>>>>>>>>>>>>> consolidating
>>>>>>>>>>>>>>>>>>>> all existing metric libraries into a single one
>>>> -
>>>>>>>>>> OpenTelemetry
>>>>>>>>>>>>>>>>>>>> <https://opentelemetry.io/>. The parent PIP
>>>> will
>>>>>>> explain
>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>> OpenTelemetry was chosen out of existing
>>>> solutions
>>>>>> and
>>>>>>>> why
>>>>>>>>> it
>>>>>>>>>>> far
>>>>>>>>>>>>>>>>>> exceeds
>>>>>>>>>>>>>>>>>>>> all other options. I’ve been working closely
>>>> with the
>>>>>>>>>>>> OpenTelemetry
>>>>>>>>>>>>>>>>>>>> community in the past eight months:
>>>> brain-storming
>>>>>> this
>>>>>>>>>>>> integration,
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> raising issues, in an effort to remove serious
>>>>>> blockers
>>>>>>>> to
>>>>>>>>>> make
>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> migration successful.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I made every effort to summarize this document
>>>> so
>>>>>> that
>>>>>>> it
>>>>>>>>> can
>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> concise
>>>>>>>>>>>>>>>>>>>> yet clear. I understand it is an effort to read
>>>> it
>>>>>> and,
>>>>>>>>> more
>>>>>>>>>>> so,
>>>>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>> meaningful feedback on such a large document;
>>>> hence
>>>>>> I’m
>>>>>>>>> very
>>>>>>>>>>>> grateful
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> each individual who does so.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I think this design will help improve the user
>>>>>>> experience
>>>>>>>>>>>> immensely,
>>>>>>>>>>>>>>>>>> so it
>>>>>>>>>>>>>>>>>>>> is worth the time spent reading it.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> === QUOTE END ===
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Asaf Mesika
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>

Re: [DISCUSS] PIP-264: Enhanced OTel-based metric system

Reply via email to