> Proposing a large breaking change (even if it's crucial) is the single fastest way to motivate your users to migrate to a different platform. I wish it wasn't the case, but it's the cold reality.
If you read the proposal, there is no real breaking change. There will be a switch to choose the existing metrics or the new ones. The dashboards will be updated and provided. At the same time, the best sure way to motivate users to switch or not adopt a platform is to stick with confusing/inconsistent APIs/Metrics. -- Matteo Merli <matteo.me...@gmail.com> On Wed, Jun 14, 2023 at 6:10 PM Devin Bost <devin.b...@gmail.com> wrote: > > Thanks for the details, Devin. Curios - 'We' stands for which company? > > What do you mean? I was quoting Rajan when I said, "we." > > > Devin G. Bost > > > On Wed, Jun 14, 2023 at 10:02 AM Asaf Mesika <asaf.mes...@gmail.com> > wrote: > > > Thanks for the details, Devin. Curios - 'We' stands for which company? > > > > Can you take a look at my previous response to see if it answers the > > concern you raised? > > > > Thanks! > > > > > > On Wed, Jun 14, 2023 at 1:49 PM Devin Bost <devin.b...@gmail.com> wrote: > > > > > > Hi, > > > > > > > > Are we proposing a change to break existing metrics compatibility > > > > (prometheus)? If that is the case then it's a big red flag as it will > > be > > > a > > > > pain for any company to upgrade Pulsar as monitoring is THE most > > > important > > > > part of the system and we don't even want to break compatibility for > > any > > > > small things to avoid interruption for users that are using Pulsar > > > system. > > > > I think it's always good to enhance a system by maintaining > > compatibility > > > > and I would be fine if we can introduce new metrics API without > causing > > > ANY > > > > interruption to existing metrics API. But if we can't maintain > > > > compatibility then it's a big red flag and not acceptable for the > > Pulsar > > > > community. > > > > > > Proposing a large breaking change (even if it's crucial) is the single > > > fastest way to motivate your users to migrate to a different platform. > I > > > wish it wasn't the case, but it's the cold reality. > > > > > > With that said, I'm a big proponent of Open Telemetry. I did a big > video > > a > > > while back that some of you may remember on the use of Open Tracing > > (before > > > it was merged into Open Telemetry). Open Telemetry has gained > > considerable > > > momentum in the industry since then. > > > > > > I'm also very interested in a solution to the metrics problem. I've run > > > into the scalability issues with metrics in production, and I've been > > very > > > concerned about the metrics bottlenecks around our ability to deliver > our > > > promises around supporting large numbers of topics. One of the big > > > advantages of Pulsar over Kafka is supposed to be that topics are > cheap, > > > but as it stands, our current metrics design gets seriously in the way > of > > > that. Generally speaking, I'm open to solutions, especially if they > align > > > us with a growing industry standard. > > > > > > - Devin > > > > > > > > > On Wed, Jun 14, 2023, 3:28 AM Enrico Olivelli <eolive...@gmail.com> > > wrote: > > > > > > > Il Mer 14 Giu 2023, 04:33 Rajan Dhabalia <rdhaba...@apache.org> ha > > > > scritto: > > > > > > > > > Hi, > > > > > > > > > > Are we proposing a change to break existing metrics compatibility > > > > > (prometheus)? If that is the case then it's a big red flag as it > will > > > be > > > > a > > > > > pain for any company to upgrade Pulsar as monitoring is THE most > > > > important > > > > > part of the system and we don't even want to break compatibility > for > > > any > > > > > small things to avoid interruption for users that are using Pulsar > > > > system. > > > > > I think it's always good to enhance a system by maintaining > > > compatibility > > > > > and I would be fine if we can introduce new metrics API without > > causing > > > > ANY > > > > > interruption to existing metrics API. But if we can't maintain > > > > > compatibility then it's a big red flag and not acceptable for the > > > Pulsar > > > > > community. > > > > > > > > > > > > > I agree. > > > > > > > > If it is possible to export data Ina way that is compatible with > > > Prometheus > > > > without adding too much overhead then I would support this work. > > > > > > > > About renaming the metrics: we can do it only if tue changes for > users > > > are > > > > as trivial as replacing the queries in the grafana dashboard or in > > > alerting > > > > systems. > > > > > > > > Asaf, do you have prototype? Built over any version of Pulsar? > > > > > > > > Also, it would be very useful to start an initiative to collect the > > list > > > of > > > > metrics that people really use in production, especially for > automated > > > > alerts. > > > > > > > > In my experience you usually care about: > > > > - in/out traffic (rates, bytes...) > > > > - number of producer, consumers, topics, subscriptions... > > > > - backlog > > > > - jvm metrics > > > > - function custom metrics > > > > > > > > > > > > Enrico > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > Rajan > > > > > > > > > > On Sun, May 21, 2023 at 9:01 AM Asaf Mesika <asaf.mes...@gmail.com > > > > > > wrote: > > > > > > > > > > > Thanks for the reply, Enrico. > > > > > > Completely agree. > > > > > > This made me realize my TL;DR wasn't talking about export. > > > > > > I added this to it: > > > > > > > > > > > > --- > > > > > > Pulsar OTel Metrics will support exporting as Prometheus HTTP > > > endpoint > > > > > > (`/metrics` but different port) for backward compatibility and > also > > > > OLTP, > > > > > > so you can push the metrics to OTel Collector and from there ship > > it > > > to > > > > > any > > > > > > destination. > > > > > > --- > > > > > > > > > > > > OTel supports two kinds of exporter: Prometheus (HTTP) and OTLP > > > (push). > > > > > > We'll just configure to use them. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, May 15, 2023 at 10:35 AM Enrico Olivelli < > > > eolive...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Asaf, > > > > > > > thanks for contributing in this area. > > > > > > > Metrics are a fundamental feature of Pulsar. > > > > > > > > > > > > > > Currently I find it very awkward to maintain metrics, and also > I > > > see > > > > > > > it as a problem to support only Prometheus. > > > > > > > > > > > > > > Regarding your proposal, IIRC in the past someone else proposed > > to > > > > > > > support other metrics systems and they have been suggested to > > use a > > > > > > > sidecar approach, > > > > > > > that is to add something next to Pulsar services that served > the > > > > > > > metrics in the preferred format/way. > > > > > > > I find that the sidecar approach is too inefficient and I am > not > > > > > > > proposing it (but I wanted to add this reference for the > benefit > > of > > > > > > > new people on the list). > > > > > > > > > > > > > > I wonder if it would be possible to keep compatibility with the > > > > > > > current Prometheus based metrics. > > > > > > > Now Pulsar reached a point in which is is widely used by many > > > > > > > companies and also with big clusters, > > > > > > > telling people that they have to rework all the infrastructure > > > > related > > > > > > > to metrics because we don't support Prometheus anymore or > because > > > we > > > > > > > changed radically the way we publish metrics > > > > > > > It is a step that seems too hard from my point of view. > > > > > > > > > > > > > > Currently I believe that compatibility is more important than > > > > > > > versatility, and if we want to introduce new (and far better) > > > > features > > > > > > > we must take it into account. > > > > > > > > > > > > > > So my point is that I generally support the idea of opening the > > way > > > > to > > > > > > > Open Telemetry, but we must have a way to not force all of our > > > users > > > > > > > to throw away their alerting systems, dashboards and know-how > in > > > > > > > troubleshooting Pulsar problems in production and dev > > > > > > > > > > > > > > Best regards > > > > > > > Enrico > > > > > > > > > > > > > > Il giorno lun 15 mag 2023 alle ore 02:17 Dave Fisher > > > > > > > <wave4d...@comcast.net> ha scritto: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On May 10, 2023, at 1:01 AM, Asaf Mesika < > > > asaf.mes...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Tue, May 9, 2023 at 11:29 PM Dave Fisher < > > w...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >>>> On May 8, 2023, at 2:49 AM, Asaf Mesika < > > > > asaf.mes...@gmail.com> > > > > > > > wrote: > > > > > > > > >>> > > > > > > > > >>> Your feedback made me realized I need to add "TL;DR" > > section, > > > > > > which I > > > > > > > > >> just > > > > > > > > >>> added. > > > > > > > > >>> > > > > > > > > >>> I'm quoting it here. It gives a brief summary of the > > > proposal, > > > > > > which > > > > > > > > >>> requires up to 5 min of read time, helping you get a high > > > level > > > > > > > picture > > > > > > > > >>> before you dive into the background/motivation/solution. > > > > > > > > >>> > > > > > > > > >>> ---------------------- > > > > > > > > >>> TL;DR > > > > > > > > >>> > > > > > > > > >>> Working with Metrics today as a user or a developer is > hard > > > and > > > > > has > > > > > > > many > > > > > > > > >>> severe issues. > > > > > > > > >>> > > > > > > > > >>> From the user perspective: > > > > > > > > >>> > > > > > > > > >>> - One of Pulsar strongest feature is "cheap" topics so > you > > > can > > > > > > > easily > > > > > > > > >>> have 10k - 100k topics per broker. Once you do that, you > > > > quickly > > > > > > > learn > > > > > > > > >> that > > > > > > > > >>> the amount of metrics you export via "/metrics" > > (Prometheus > > > > > style > > > > > > > > >> endpoint) > > > > > > > > >>> becomes really big. The cost to store them becomes too > > high, > > > > > > queries > > > > > > > > >>> time-out or even "/metrics" endpoint it self times out. > > > > > > > > >>> The only option Pulsar gives you today is all-or-nothing > > > > > filtering > > > > > > > and > > > > > > > > >>> very crude aggregation. You switch metrics from topic > > > > > aggregation > > > > > > > > >> level to > > > > > > > > >>> namespace aggregation level. Also you can turn off > > producer > > > > and > > > > > > > > >> consumer > > > > > > > > >>> level metrics. You end up doing it all leaving you > > "blind", > > > > > > looking > > > > > > > at > > > > > > > > >> the > > > > > > > > >>> metrics from a namespace level which is too high level. > > You > > > > end > > > > > up > > > > > > > > >>> conjuring all kinds of scripts on top of topic stats > > > endpoint > > > > to > > > > > > > glue > > > > > > > > >> some > > > > > > > > >>> aggregated metrics view for the topics you need. > > > > > > > > >>> - Summaries (metric type giving you quantiles like p95) > > > which > > > > > are > > > > > > > used > > > > > > > > >>> in Pulsar, can't be aggregated across topics / brokers > due > > > its > > > > > > > inherent > > > > > > > > >>> design. > > > > > > > > >>> - Plugin authors spend too much time on defining and > > > exposing > > > > > > > metrics > > > > > > > > >> to > > > > > > > > >>> Pulsar since the only interface Pulsar offers is writing > > > your > > > > > > > metrics > > > > > > > > >> by > > > > > > > > >>> your self as UTF-8 bytes in Prometheus Text Format to > byte > > > > > stream > > > > > > > > >> interface > > > > > > > > >>> given to you. > > > > > > > > >>> - Pulsar histograms are exported in a way that is not > > > > conformant > > > > > > > with > > > > > > > > >>> Prometheus, which means you can't get the p95 quantile > on > > > such > > > > > > > > >> histograms, > > > > > > > > >>> making them very hard to use in day to day life. > > > > > > > > >> > > > > > > > > >> What version of DataSketches is used to produce the > > histogram? > > > > Is > > > > > is > > > > > > > still > > > > > > > > >> an old Yahoo one, or are we using an updated one from > Apache > > > > > > > DataSketches? > > > > > > > > >> > > > > > > > > >> Seems like this is a single PR/small PIP for 3.1? > > > > > > > > > > > > > > > > > > > > > > > > > > > Histograms are a list of buckets, each is a counter. > > > > > > > > > Summary is a collection of values collected over a time > > window, > > > > > which > > > > > > > at > > > > > > > > > the end you get a calculation of the quantiles of those > > values: > > > > > p95, > > > > > > > p50, > > > > > > > > > and those are exported from Pulsar. > > > > > > > > > > > > > > > > > > Pulsar histogram do not use Data Sketches. > > > > > > > > > > > > > > > > Bookkeeper Metrics wraps Yahoo DataSketches last I checked. > > > > > > > > > > > > > > > > > They are just counters. > > > > > > > > > They are not adhere to Prometheus since: > > > > > > > > > a. The counter is expected to be cumulative, but Pulsar > > resets > > > > each > > > > > > > bucket > > > > > > > > > counter to 0 every 1 min > > > > > > > > > b. The bucket upper range is expected to be written as an > > > > attribute > > > > > > > "le" > > > > > > > > > but today it is encoded in the name of the metric itself. > > > > > > > > > > > > > > > > > > This is a breaking change, hence hard to mark in any small > > > > release. > > > > > > > > > This is why it's part of this PIP since so many things will > > > > break, > > > > > > and > > > > > > > all > > > > > > > > > of them will break on a separate layer (OTel metrics), > hence > > > not > > > > > > break > > > > > > > > > anyone without their consent. > > > > > > > > > > > > > > > > If this change will break existing Grafana dashboards and > other > > > > > > > operational monitoring already in place then it will break > > > guarantees > > > > > we > > > > > > > have made about safely being able to downgrade from a bad > > upgrade. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >>> - Too many metrics are rates which also delta reset > every > > > > > interval > > > > > > > you > > > > > > > > >>> configure in Pulsar and restart, instead of relying on > > > > > cumulative > > > > > > > (ever > > > > > > > > >>> growing) counters and let Prometheus use its rate > > function. > > > > > > > > >>> - and many more issues > > > > > > > > >>> > > > > > > > > >>> From the developer perspective: > > > > > > > > >>> > > > > > > > > >>> - There are 4 different ways to define and record > metrics > > in > > > > > > Pulsar: > > > > > > > > >>> Pulsar own metrics library, Prometheus Java Client, > > > Bookkeeper > > > > > > > metrics > > > > > > > > >>> library and plain native Java SDK objects (AtomicLong, > > ...). > > > > > It's > > > > > > > very > > > > > > > > >>> confusing for the developer and create inconsistencies > for > > > the > > > > > end > > > > > > > user > > > > > > > > >>> (e.g. Summary for example is different in each). > > > > > > > > >>> - Patching your metrics into "/metrics" Prometheus > > endpoint > > > is > > > > > > > > >>> confusing, cumbersome and error prone. > > > > > > > > >>> - many more > > > > > > > > >>> > > > > > > > > >>> This proposal offers several key changes to solve that: > > > > > > > > >>> > > > > > > > > >>> - Cardinality (supporting 10k-100k topics per broker) is > > > > solved > > > > > by > > > > > > > > >>> introducing a new aggregation level for metrics called > > Topic > > > > > > Metric > > > > > > > > >> Group. > > > > > > > > >>> Using configuration, you specify for each topic its > group > > > > (using > > > > > > > > >>> wildcard/regex). This allows you to "zoom" out to a more > > > > > detailed > > > > > > > > >>> granularity level like groups instead of namespaces, > which > > > you > > > > > > > control > > > > > > > > >> how > > > > > > > > >>> many groups you'll have hence solving the cardinality > > issue, > > > > > > without > > > > > > > > >>> sacrificing level of detail too much. > > > > > > > > >>> - Fine-grained filtering mechanism, dynamic. You'll have > > > > > > rule-based > > > > > > > > >>> dynamic configuration, allowing you to specify per > > > > > > > > >> namespace/topic/group > > > > > > > > >>> which metrics you'd like to keep/drop. Rules allows you > to > > > set > > > > > the > > > > > > > > >> default > > > > > > > > >>> to have small amount of metrics in group and namespace > > level > > > > > only > > > > > > > and > > > > > > > > >> drop > > > > > > > > >>> the rest. When needed, you can add an override rule to > > > "open" > > > > > up a > > > > > > > > >> certain > > > > > > > > >>> group to have more metrics in higher granularity (topic > or > > > > even > > > > > > > > >>> consumer/producer level). Since it's dynamic you "open" > > > such a > > > > > > group > > > > > > > > >> when > > > > > > > > >>> you see it's misbehaving, see it in topic level, and > when > > > all > > > > > > > > >> resolved, you > > > > > > > > >>> can "close" it. A bit similar experience to logging > levels > > > in > > > > > > Log4j > > > > > > > or > > > > > > > > >>> Logback, that you default and override per > class/package. > > > > > > > > >>> > > > > > > > > >>> Aggregation and Filtering combined solves the cardinality > > > > without > > > > > > > > >>> sacrificing the level of detail when needed and most > > > > importantly, > > > > > > you > > > > > > > > >>> determine which topic/group/namespace it happens on. > > > > > > > > >>> > > > > > > > > >>> Since this change is so invasive, it requires a single > > > metrics > > > > > > > library to > > > > > > > > >>> implement all of it on top of; Hence the third big change > > > point > > > > > is > > > > > > > > >>> consolidating all four ways to define and record metrics > > to a > > > > > > single > > > > > > > > >> one, a > > > > > > > > >>> new one: OpenTelemtry Metrics (Java SDK, and also Python > > and > > > Go > > > > > for > > > > > > > the > > > > > > > > >>> Pulsar Function runners). > > > > > > > > >>> Introducing OpenTelemetry (OTel) solves also the biggest > > pain > > > > > point > > > > > > > from > > > > > > > > >>> the developer perspective, since it's a superb metrics > > > library > > > > > > > offering > > > > > > > > >>> everything you need, and there is going to be a single > way > > - > > > > only > > > > > > it. > > > > > > > > >> Also, > > > > > > > > >>> it solves the robustness for Plugin author which will use > > > > > > > OpenTelemetry. > > > > > > > > >> It > > > > > > > > >>> so happens that it also solves all the numerous problems > > > > > described > > > > > > > in the > > > > > > > > >>> doc itself. > > > > > > > > >>> > > > > > > > > >>> The solution will be introduced as another layer with > > feature > > > > > > > toggles, so > > > > > > > > >>> you can work with existing system, and/or OTel, until > > > gradually > > > > > > > > >> deprecating > > > > > > > > >>> existing system. > > > > > > > > >>> > > > > > > > > >>> It's a big breaking change for Pulsar users on many > fronts: > > > > > names, > > > > > > > > >>> semantics, configuration. Read at the end of this doc to > > > learn > > > > > > > exactly > > > > > > > > >> what > > > > > > > > >>> will change for the user (in high level). > > > > > > > > >>> > > > > > > > > >>> In my opinion, it will make Pulsar user experience so > much > > > > > better, > > > > > > > they > > > > > > > > >>> will want to migrate to it, despite the breaking change. > > > > > > > > >>> > > > > > > > > >>> This was a very short summary. You are most welcomed to > > read > > > > the > > > > > > full > > > > > > > > >>> design document below and express feedback, so we can > make > > it > > > > > > better. > > > > > > > > >>> > > > > > > > > >>> On Sun, May 7, 2023 at 7:52 PM Asaf Mesika < > > > > > asaf.mes...@gmail.com> > > > > > > > > >> wrote: > > > > > > > > >>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> On Sun, May 7, 2023 at 4:23 PM Yunze Xu > > > > > > > <y...@streamnative.io.invalid> > > > > > > > > >>>> wrote: > > > > > > > > >>>> > > > > > > > > >>>>> I'm excited to learn much more about metrics when I > > started > > > > > > reading > > > > > > > > >>>>> this proposal. But I became more and more frustrated > > when I > > > > > found > > > > > > > > >>>>> there is still too much content left even if I've > already > > > > spent > > > > > > > much > > > > > > > > >>>>> time reading this proposal. I'm wondering how much time > > did > > > > you > > > > > > > expect > > > > > > > > >>>>> reviewers to read through this proposal? I just > recalled > > > the > > > > > > > > >>>>> discussion you started before [1]. Did you expect each > > PMC > > > > > member > > > > > > > that > > > > > > > > >>>>> gives his/her +1 to read only parts of this proposal? > > > > > > > > >>>>> > > > > > > > > >>>> > > > > > > > > >>>> I estimated around 2 hours needed for a reviewer. > > > > > > > > >>>> I hate it being so long, but I simply couldn't find a > way > > to > > > > > > > downsize it > > > > > > > > >>>> more. Furthermore, I consulted with my colleagues > > including > > > > > > Matteo, > > > > > > > but > > > > > > > > >> we > > > > > > > > >>>> couldn't see a way to scope it down. > > > > > > > > >>>> Why? Because once you begin this journey, you need to > know > > > how > > > > > > it's > > > > > > > > >> going > > > > > > > > >>>> to end. > > > > > > > > >>>> What I ended up doing, is writing all the crucial > details > > > for > > > > > > > review in > > > > > > > > >>>> the High Level Design section. > > > > > > > > >>>> It's still a big, hefty section, but I don't think I can > > > step > > > > > out > > > > > > > or let > > > > > > > > >>>> anyone else change Pulsar so invasively without the full > > > > extent > > > > > of > > > > > > > the > > > > > > > > >>>> change. > > > > > > > > >>>> > > > > > > > > >>>> I don't think it's wise to read parts. > > > > > > > > >>>> I did my very best effort to minimize it, but the scope > is > > > > > simply > > > > > > > big. > > > > > > > > >>>> Open for suggestions, but it requires reading all the > PIP > > :) > > > > > > > > >>>> > > > > > > > > >>>> Thanks a lot Yunze for dedicating any time to it. > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>>> > > > > > > > > >>>>> Let's talk back to the proposal, for now, what I mainly > > > > learned > > > > > > and > > > > > > > > >>>>> are concerned about mostly are: > > > > > > > > >>>>> 1. Pulsar has many ways to expose metrics. It's not > > unified > > > > and > > > > > > > > >> confusing. > > > > > > > > >>>>> 2. The current metrics system cannot support a large > > amount > > > > of > > > > > > > topics. > > > > > > > > >>>>> 3. It's hard for plugin authors to integrate metrics. > > (For > > > > > > example, > > > > > > > > >>>>> KoP [2] integrates metrics by implementing the > > > > > > > > >>>>> PrometheusRawMetricsProvider interface and it indeed > > needs > > > > much > > > > > > > work) > > > > > > > > >>>>> > > > > > > > > >>>>> Regarding the 1st issue, this proposal chooses > > > OpenTelemetry > > > > > > > (OTel). > > > > > > > > >>>>> > > > > > > > > >>>>> Regarding the 2nd issue, I scrolled to the "Why > > > > OpenTelemetry?" > > > > > > > > >>>>> section. It's still frustrating to see no answer. > > > > Eventually, I > > > > > > > found > > > > > > > > >>>>> > > > > > > > > >>>> > > > > > > > > >>>> OpenTelemetry isn't the solution for large amount of > > topic. > > > > > > > > >>>> The solution is described at > > > > > > > > >>>> "Aggregate and Filtering to solve cardinality issues" > > > section. > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>>> the explanation in the "What we need to fix in > > > OpenTelemetry > > > > - > > > > > > > > >>>>> Performance" section. It seems that we still need some > > > > > > > enhancements in > > > > > > > > >>>>> OTel. In other words, currently OTel is not ready for > > > > resolving > > > > > > all > > > > > > > > >>>>> these issues listed in the proposal but we believe it > > will. > > > > > > > > >>>>> > > > > > > > > >>>> > > > > > > > > >>>> Let me rephrase "believe" --> we work together with the > > > > > > maintainers > > > > > > > to > > > > > > > > >> do > > > > > > > > >>>> it, yes. > > > > > > > > >>>> I am open for any other suggestion. > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>>> > > > > > > > > >>>>> As for the 3rd issue, from the "Integrating with Pulsar > > > > > Plugins" > > > > > > > > >>>>> section, the plugin authors still need to implement the > > new > > > > > OTel > > > > > > > > >>>>> interfaces. Is it much easier than using the existing > > ways > > > to > > > > > > > expose > > > > > > > > >>>>> metrics? Could metrics still be easily integrated with > > > > Grafana? > > > > > > > > >>>>> > > > > > > > > >>>> > > > > > > > > >>>> Yes, it's way easier. > > > > > > > > >>>> Basically you have a full fledged metrics library > objects: > > > > > Meter, > > > > > > > Gauge, > > > > > > > > >>>> Histogram, Counter. > > > > > > > > >>>> No more Raw Metrics Provider, writing UTF-8 bytes in > > > > Prometheus > > > > > > > format. > > > > > > > > >>>> You get namespacing for free with Meter name and > version. > > > > > > > > >>>> It's way better than current solution and any other > > library. > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>>> > > > > > > > > >>>>> That's all I am concerned about at the moment. I > > > understand, > > > > > and > > > > > > > > >>>>> appreciate that you've spent much time studying and > > > > explaining > > > > > > all > > > > > > > > >>>>> these things. But, this proposal is still too huge. > > > > > > > > >>>>> > > > > > > > > >>>> > > > > > > > > >>>> I appreciate your effort a lot! > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>>> > > > > > > > > >>>>> [1] > > > > > > > > https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf > > > > > > > > >>>>> [2] > > > > > > > > >>>>> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java > > > > > > > > >>>>> > > > > > > > > >>>>> Thanks, > > > > > > > > >>>>> Yunze > > > > > > > > >>>>> > > > > > > > > >>>>> On Sun, May 7, 2023 at 5:53 PM Asaf Mesika < > > > > > > asaf.mes...@gmail.com> > > > > > > > > >> wrote: > > > > > > > > >>>>>> > > > > > > > > >>>>>> I'm very appreciative for feedback from multiple > pulsar > > > > users > > > > > > and > > > > > > > devs > > > > > > > > >>>>> on > > > > > > > > >>>>>> this PIP, since it has dramatic changes suggested and > > > quite > > > > > > > extensive > > > > > > > > >>>>>> positive change for the users. > > > > > > > > >>>>>> > > > > > > > > >>>>>> > > > > > > > > >>>>>> On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika < > > > > > > > asaf.mes...@gmail.com> > > > > > > > > >>>>> wrote: > > > > > > > > >>>>>> > > > > > > > > >>>>>>> Hi all, > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> I'm very excited to release a PIP I've been working > on > > in > > > > the > > > > > > > past 11 > > > > > > > > >>>>>>> months, which I think will be immensely valuable to > > > Pulsar, > > > > > > > which I > > > > > > > > >>>>> like so > > > > > > > > >>>>>>> much. > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> PIP: https://github.com/apache/pulsar/issues/20197 > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> I'm quoting here the preface: > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> === QUOTE START === > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> Roughly 11 months ago, I started working on solving > the > > > > > biggest > > > > > > > issue > > > > > > > > >>>>> with > > > > > > > > >>>>>>> Pulsar metrics: the lack of ability to monitor a > pulsar > > > > > broker > > > > > > > with a > > > > > > > > >>>>> large > > > > > > > > >>>>>>> topic count: 10k, 100k, and future support of 1M. > This > > > > > started > > > > > > by > > > > > > > > >>>>> mapping > > > > > > > > >>>>>>> the existing functionality and then enumerating all > the > > > > > > problems > > > > > > > I > > > > > > > > >>>>> saw (all > > > > > > > > >>>>>>> documented in this doc > > > > > > > > >>>>>>> < > > > > > > > > >>>>> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing > > > > > > > > > > > > > > > > I thought we were going to stop using Google docs for PIPs. > > > > > > > > > > > > > > > > >>>>>> > > > > > > > > >>>>>>> ). > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> This PIP is a parent PIP. It aims to gradually solve > > > (using > > > > > > > sub-PIPs) > > > > > > > > >>>>> all > > > > > > > > >>>>>>> the current metric system's problems and provide the > > > > ability > > > > > to > > > > > > > > >>>>> monitor a > > > > > > > > >>>>>>> broker with a large topic count, which is currently > > > > lacking. > > > > > > As a > > > > > > > > >>>>> parent > > > > > > > > >>>>>>> PIP, it will describe each problem and its solution > at > > a > > > > high > > > > > > > level, > > > > > > > > >>>>>>> leaving fine-grained details to the sub-PIPs. The > > parent > > > > PIP > > > > > > > ensures > > > > > > > > >>>>> all > > > > > > > > >>>>>>> solutions align and does not contradict each other. > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> The basic building block to solve the monitoring > > ability > > > of > > > > > > large > > > > > > > > >>>>> topic > > > > > > > > >>>>>>> count is aggregating internally (to topic groups) and > > > > adding > > > > > > > > >>>>> fine-grained > > > > > > > > >>>>>>> filtering. We could have shoe-horned it into the > > existing > > > > > > metric > > > > > > > > >>>>> system, > > > > > > > > >>>>>>> but we thought adding that to a system already > > ingrained > > > > with > > > > > > > many > > > > > > > > >>>>> problems > > > > > > > > >>>>>>> would be wrong and hard to do gradually, as so many > > > things > > > > > will > > > > > > > > >>>>> break. This > > > > > > > > >>>>>>> is why the second-biggest design decision presented > > here > > > is > > > > > > > > >>>>> consolidating > > > > > > > > >>>>>>> all existing metric libraries into a single one - > > > > > OpenTelemetry > > > > > > > > >>>>>>> <https://opentelemetry.io/>. The parent PIP will > > explain > > > > why > > > > > > > > >>>>>>> OpenTelemetry was chosen out of existing solutions > and > > > why > > > > it > > > > > > far > > > > > > > > >>>>> exceeds > > > > > > > > >>>>>>> all other options. I’ve been working closely with the > > > > > > > OpenTelemetry > > > > > > > > >>>>>>> community in the past eight months: brain-storming > this > > > > > > > integration, > > > > > > > > >>>>> and > > > > > > > > >>>>>>> raising issues, in an effort to remove serious > blockers > > > to > > > > > make > > > > > > > this > > > > > > > > >>>>>>> migration successful. > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> I made every effort to summarize this document so > that > > it > > > > can > > > > > > be > > > > > > > > >>>>> concise > > > > > > > > >>>>>>> yet clear. I understand it is an effort to read it > and, > > > > more > > > > > > so, > > > > > > > > >>>>> provide > > > > > > > > >>>>>>> meaningful feedback on such a large document; hence > I’m > > > > very > > > > > > > grateful > > > > > > > > >>>>> for > > > > > > > > >>>>>>> each individual who does so. > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> I think this design will help improve the user > > experience > > > > > > > immensely, > > > > > > > > >>>>> so it > > > > > > > > >>>>>>> is worth the time spent reading it. > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> === QUOTE END === > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> Thanks! > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> Asaf Mesika > > > > > > > > >>>>>>> > > > > > > > > >>>>> > > > > > > > > >>>> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >