> Thanks for the details, Devin. Curios - 'We' stands for which company?
What do you mean? I was quoting Rajan when I said, "we." Devin G. Bost On Wed, Jun 14, 2023 at 10:02 AM Asaf Mesika <asaf.mes...@gmail.com> wrote: > Thanks for the details, Devin. Curios - 'We' stands for which company? > > Can you take a look at my previous response to see if it answers the > concern you raised? > > Thanks! > > > On Wed, Jun 14, 2023 at 1:49 PM Devin Bost <devin.b...@gmail.com> wrote: > > > > Hi, > > > > > > Are we proposing a change to break existing metrics compatibility > > > (prometheus)? If that is the case then it's a big red flag as it will > be > > a > > > pain for any company to upgrade Pulsar as monitoring is THE most > > important > > > part of the system and we don't even want to break compatibility for > any > > > small things to avoid interruption for users that are using Pulsar > > system. > > > I think it's always good to enhance a system by maintaining > compatibility > > > and I would be fine if we can introduce new metrics API without causing > > ANY > > > interruption to existing metrics API. But if we can't maintain > > > compatibility then it's a big red flag and not acceptable for the > Pulsar > > > community. > > > > Proposing a large breaking change (even if it's crucial) is the single > > fastest way to motivate your users to migrate to a different platform. I > > wish it wasn't the case, but it's the cold reality. > > > > With that said, I'm a big proponent of Open Telemetry. I did a big video > a > > while back that some of you may remember on the use of Open Tracing > (before > > it was merged into Open Telemetry). Open Telemetry has gained > considerable > > momentum in the industry since then. > > > > I'm also very interested in a solution to the metrics problem. I've run > > into the scalability issues with metrics in production, and I've been > very > > concerned about the metrics bottlenecks around our ability to deliver our > > promises around supporting large numbers of topics. One of the big > > advantages of Pulsar over Kafka is supposed to be that topics are cheap, > > but as it stands, our current metrics design gets seriously in the way of > > that. Generally speaking, I'm open to solutions, especially if they align > > us with a growing industry standard. > > > > - Devin > > > > > > On Wed, Jun 14, 2023, 3:28 AM Enrico Olivelli <eolive...@gmail.com> > wrote: > > > > > Il Mer 14 Giu 2023, 04:33 Rajan Dhabalia <rdhaba...@apache.org> ha > > > scritto: > > > > > > > Hi, > > > > > > > > Are we proposing a change to break existing metrics compatibility > > > > (prometheus)? If that is the case then it's a big red flag as it will > > be > > > a > > > > pain for any company to upgrade Pulsar as monitoring is THE most > > > important > > > > part of the system and we don't even want to break compatibility for > > any > > > > small things to avoid interruption for users that are using Pulsar > > > system. > > > > I think it's always good to enhance a system by maintaining > > compatibility > > > > and I would be fine if we can introduce new metrics API without > causing > > > ANY > > > > interruption to existing metrics API. But if we can't maintain > > > > compatibility then it's a big red flag and not acceptable for the > > Pulsar > > > > community. > > > > > > > > > > I agree. > > > > > > If it is possible to export data Ina way that is compatible with > > Prometheus > > > without adding too much overhead then I would support this work. > > > > > > About renaming the metrics: we can do it only if tue changes for users > > are > > > as trivial as replacing the queries in the grafana dashboard or in > > alerting > > > systems. > > > > > > Asaf, do you have prototype? Built over any version of Pulsar? > > > > > > Also, it would be very useful to start an initiative to collect the > list > > of > > > metrics that people really use in production, especially for automated > > > alerts. > > > > > > In my experience you usually care about: > > > - in/out traffic (rates, bytes...) > > > - number of producer, consumers, topics, subscriptions... > > > - backlog > > > - jvm metrics > > > - function custom metrics > > > > > > > > > Enrico > > > > > > > > > > > > > > > > Thanks, > > > > Rajan > > > > > > > > On Sun, May 21, 2023 at 9:01 AM Asaf Mesika <asaf.mes...@gmail.com> > > > wrote: > > > > > > > > > Thanks for the reply, Enrico. > > > > > Completely agree. > > > > > This made me realize my TL;DR wasn't talking about export. > > > > > I added this to it: > > > > > > > > > > --- > > > > > Pulsar OTel Metrics will support exporting as Prometheus HTTP > > endpoint > > > > > (`/metrics` but different port) for backward compatibility and also > > > OLTP, > > > > > so you can push the metrics to OTel Collector and from there ship > it > > to > > > > any > > > > > destination. > > > > > --- > > > > > > > > > > OTel supports two kinds of exporter: Prometheus (HTTP) and OTLP > > (push). > > > > > We'll just configure to use them. > > > > > > > > > > > > > > > > > > > > On Mon, May 15, 2023 at 10:35 AM Enrico Olivelli < > > eolive...@gmail.com> > > > > > wrote: > > > > > > > > > > > Asaf, > > > > > > thanks for contributing in this area. > > > > > > Metrics are a fundamental feature of Pulsar. > > > > > > > > > > > > Currently I find it very awkward to maintain metrics, and also I > > see > > > > > > it as a problem to support only Prometheus. > > > > > > > > > > > > Regarding your proposal, IIRC in the past someone else proposed > to > > > > > > support other metrics systems and they have been suggested to > use a > > > > > > sidecar approach, > > > > > > that is to add something next to Pulsar services that served the > > > > > > metrics in the preferred format/way. > > > > > > I find that the sidecar approach is too inefficient and I am not > > > > > > proposing it (but I wanted to add this reference for the benefit > of > > > > > > new people on the list). > > > > > > > > > > > > I wonder if it would be possible to keep compatibility with the > > > > > > current Prometheus based metrics. > > > > > > Now Pulsar reached a point in which is is widely used by many > > > > > > companies and also with big clusters, > > > > > > telling people that they have to rework all the infrastructure > > > related > > > > > > to metrics because we don't support Prometheus anymore or because > > we > > > > > > changed radically the way we publish metrics > > > > > > It is a step that seems too hard from my point of view. > > > > > > > > > > > > Currently I believe that compatibility is more important than > > > > > > versatility, and if we want to introduce new (and far better) > > > features > > > > > > we must take it into account. > > > > > > > > > > > > So my point is that I generally support the idea of opening the > way > > > to > > > > > > Open Telemetry, but we must have a way to not force all of our > > users > > > > > > to throw away their alerting systems, dashboards and know-how in > > > > > > troubleshooting Pulsar problems in production and dev > > > > > > > > > > > > Best regards > > > > > > Enrico > > > > > > > > > > > > Il giorno lun 15 mag 2023 alle ore 02:17 Dave Fisher > > > > > > <wave4d...@comcast.net> ha scritto: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On May 10, 2023, at 1:01 AM, Asaf Mesika < > > asaf.mes...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > On Tue, May 9, 2023 at 11:29 PM Dave Fisher < > w...@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > >> > > > > > > > >> > > > > > > > >>>> On May 8, 2023, at 2:49 AM, Asaf Mesika < > > > asaf.mes...@gmail.com> > > > > > > wrote: > > > > > > > >>> > > > > > > > >>> Your feedback made me realized I need to add "TL;DR" > section, > > > > > which I > > > > > > > >> just > > > > > > > >>> added. > > > > > > > >>> > > > > > > > >>> I'm quoting it here. It gives a brief summary of the > > proposal, > > > > > which > > > > > > > >>> requires up to 5 min of read time, helping you get a high > > level > > > > > > picture > > > > > > > >>> before you dive into the background/motivation/solution. > > > > > > > >>> > > > > > > > >>> ---------------------- > > > > > > > >>> TL;DR > > > > > > > >>> > > > > > > > >>> Working with Metrics today as a user or a developer is hard > > and > > > > has > > > > > > many > > > > > > > >>> severe issues. > > > > > > > >>> > > > > > > > >>> From the user perspective: > > > > > > > >>> > > > > > > > >>> - One of Pulsar strongest feature is "cheap" topics so you > > can > > > > > > easily > > > > > > > >>> have 10k - 100k topics per broker. Once you do that, you > > > quickly > > > > > > learn > > > > > > > >> that > > > > > > > >>> the amount of metrics you export via "/metrics" > (Prometheus > > > > style > > > > > > > >> endpoint) > > > > > > > >>> becomes really big. The cost to store them becomes too > high, > > > > > queries > > > > > > > >>> time-out or even "/metrics" endpoint it self times out. > > > > > > > >>> The only option Pulsar gives you today is all-or-nothing > > > > filtering > > > > > > and > > > > > > > >>> very crude aggregation. You switch metrics from topic > > > > aggregation > > > > > > > >> level to > > > > > > > >>> namespace aggregation level. Also you can turn off > producer > > > and > > > > > > > >> consumer > > > > > > > >>> level metrics. You end up doing it all leaving you > "blind", > > > > > looking > > > > > > at > > > > > > > >> the > > > > > > > >>> metrics from a namespace level which is too high level. > You > > > end > > > > up > > > > > > > >>> conjuring all kinds of scripts on top of topic stats > > endpoint > > > to > > > > > > glue > > > > > > > >> some > > > > > > > >>> aggregated metrics view for the topics you need. > > > > > > > >>> - Summaries (metric type giving you quantiles like p95) > > which > > > > are > > > > > > used > > > > > > > >>> in Pulsar, can't be aggregated across topics / brokers due > > its > > > > > > inherent > > > > > > > >>> design. > > > > > > > >>> - Plugin authors spend too much time on defining and > > exposing > > > > > > metrics > > > > > > > >> to > > > > > > > >>> Pulsar since the only interface Pulsar offers is writing > > your > > > > > > metrics > > > > > > > >> by > > > > > > > >>> your self as UTF-8 bytes in Prometheus Text Format to byte > > > > stream > > > > > > > >> interface > > > > > > > >>> given to you. > > > > > > > >>> - Pulsar histograms are exported in a way that is not > > > conformant > > > > > > with > > > > > > > >>> Prometheus, which means you can't get the p95 quantile on > > such > > > > > > > >> histograms, > > > > > > > >>> making them very hard to use in day to day life. > > > > > > > >> > > > > > > > >> What version of DataSketches is used to produce the > histogram? > > > Is > > > > is > > > > > > still > > > > > > > >> an old Yahoo one, or are we using an updated one from Apache > > > > > > DataSketches? > > > > > > > >> > > > > > > > >> Seems like this is a single PR/small PIP for 3.1? > > > > > > > > > > > > > > > > > > > > > > > > Histograms are a list of buckets, each is a counter. > > > > > > > > Summary is a collection of values collected over a time > window, > > > > which > > > > > > at > > > > > > > > the end you get a calculation of the quantiles of those > values: > > > > p95, > > > > > > p50, > > > > > > > > and those are exported from Pulsar. > > > > > > > > > > > > > > > > Pulsar histogram do not use Data Sketches. > > > > > > > > > > > > > > Bookkeeper Metrics wraps Yahoo DataSketches last I checked. > > > > > > > > > > > > > > > They are just counters. > > > > > > > > They are not adhere to Prometheus since: > > > > > > > > a. The counter is expected to be cumulative, but Pulsar > resets > > > each > > > > > > bucket > > > > > > > > counter to 0 every 1 min > > > > > > > > b. The bucket upper range is expected to be written as an > > > attribute > > > > > > "le" > > > > > > > > but today it is encoded in the name of the metric itself. > > > > > > > > > > > > > > > > This is a breaking change, hence hard to mark in any small > > > release. > > > > > > > > This is why it's part of this PIP since so many things will > > > break, > > > > > and > > > > > > all > > > > > > > > of them will break on a separate layer (OTel metrics), hence > > not > > > > > break > > > > > > > > anyone without their consent. > > > > > > > > > > > > > > If this change will break existing Grafana dashboards and other > > > > > > operational monitoring already in place then it will break > > guarantees > > > > we > > > > > > have made about safely being able to downgrade from a bad > upgrade. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > >> > > > > > > > >>> - Too many metrics are rates which also delta reset every > > > > interval > > > > > > you > > > > > > > >>> configure in Pulsar and restart, instead of relying on > > > > cumulative > > > > > > (ever > > > > > > > >>> growing) counters and let Prometheus use its rate > function. > > > > > > > >>> - and many more issues > > > > > > > >>> > > > > > > > >>> From the developer perspective: > > > > > > > >>> > > > > > > > >>> - There are 4 different ways to define and record metrics > in > > > > > Pulsar: > > > > > > > >>> Pulsar own metrics library, Prometheus Java Client, > > Bookkeeper > > > > > > metrics > > > > > > > >>> library and plain native Java SDK objects (AtomicLong, > ...). > > > > It's > > > > > > very > > > > > > > >>> confusing for the developer and create inconsistencies for > > the > > > > end > > > > > > user > > > > > > > >>> (e.g. Summary for example is different in each). > > > > > > > >>> - Patching your metrics into "/metrics" Prometheus > endpoint > > is > > > > > > > >>> confusing, cumbersome and error prone. > > > > > > > >>> - many more > > > > > > > >>> > > > > > > > >>> This proposal offers several key changes to solve that: > > > > > > > >>> > > > > > > > >>> - Cardinality (supporting 10k-100k topics per broker) is > > > solved > > > > by > > > > > > > >>> introducing a new aggregation level for metrics called > Topic > > > > > Metric > > > > > > > >> Group. > > > > > > > >>> Using configuration, you specify for each topic its group > > > (using > > > > > > > >>> wildcard/regex). This allows you to "zoom" out to a more > > > > detailed > > > > > > > >>> granularity level like groups instead of namespaces, which > > you > > > > > > control > > > > > > > >> how > > > > > > > >>> many groups you'll have hence solving the cardinality > issue, > > > > > without > > > > > > > >>> sacrificing level of detail too much. > > > > > > > >>> - Fine-grained filtering mechanism, dynamic. You'll have > > > > > rule-based > > > > > > > >>> dynamic configuration, allowing you to specify per > > > > > > > >> namespace/topic/group > > > > > > > >>> which metrics you'd like to keep/drop. Rules allows you to > > set > > > > the > > > > > > > >> default > > > > > > > >>> to have small amount of metrics in group and namespace > level > > > > only > > > > > > and > > > > > > > >> drop > > > > > > > >>> the rest. When needed, you can add an override rule to > > "open" > > > > up a > > > > > > > >> certain > > > > > > > >>> group to have more metrics in higher granularity (topic or > > > even > > > > > > > >>> consumer/producer level). Since it's dynamic you "open" > > such a > > > > > group > > > > > > > >> when > > > > > > > >>> you see it's misbehaving, see it in topic level, and when > > all > > > > > > > >> resolved, you > > > > > > > >>> can "close" it. A bit similar experience to logging levels > > in > > > > > Log4j > > > > > > or > > > > > > > >>> Logback, that you default and override per class/package. > > > > > > > >>> > > > > > > > >>> Aggregation and Filtering combined solves the cardinality > > > without > > > > > > > >>> sacrificing the level of detail when needed and most > > > importantly, > > > > > you > > > > > > > >>> determine which topic/group/namespace it happens on. > > > > > > > >>> > > > > > > > >>> Since this change is so invasive, it requires a single > > metrics > > > > > > library to > > > > > > > >>> implement all of it on top of; Hence the third big change > > point > > > > is > > > > > > > >>> consolidating all four ways to define and record metrics > to a > > > > > single > > > > > > > >> one, a > > > > > > > >>> new one: OpenTelemtry Metrics (Java SDK, and also Python > and > > Go > > > > for > > > > > > the > > > > > > > >>> Pulsar Function runners). > > > > > > > >>> Introducing OpenTelemetry (OTel) solves also the biggest > pain > > > > point > > > > > > from > > > > > > > >>> the developer perspective, since it's a superb metrics > > library > > > > > > offering > > > > > > > >>> everything you need, and there is going to be a single way > - > > > only > > > > > it. > > > > > > > >> Also, > > > > > > > >>> it solves the robustness for Plugin author which will use > > > > > > OpenTelemetry. > > > > > > > >> It > > > > > > > >>> so happens that it also solves all the numerous problems > > > > described > > > > > > in the > > > > > > > >>> doc itself. > > > > > > > >>> > > > > > > > >>> The solution will be introduced as another layer with > feature > > > > > > toggles, so > > > > > > > >>> you can work with existing system, and/or OTel, until > > gradually > > > > > > > >> deprecating > > > > > > > >>> existing system. > > > > > > > >>> > > > > > > > >>> It's a big breaking change for Pulsar users on many fronts: > > > > names, > > > > > > > >>> semantics, configuration. Read at the end of this doc to > > learn > > > > > > exactly > > > > > > > >> what > > > > > > > >>> will change for the user (in high level). > > > > > > > >>> > > > > > > > >>> In my opinion, it will make Pulsar user experience so much > > > > better, > > > > > > they > > > > > > > >>> will want to migrate to it, despite the breaking change. > > > > > > > >>> > > > > > > > >>> This was a very short summary. You are most welcomed to > read > > > the > > > > > full > > > > > > > >>> design document below and express feedback, so we can make > it > > > > > better. > > > > > > > >>> > > > > > > > >>> On Sun, May 7, 2023 at 7:52 PM Asaf Mesika < > > > > asaf.mes...@gmail.com> > > > > > > > >> wrote: > > > > > > > >>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> On Sun, May 7, 2023 at 4:23 PM Yunze Xu > > > > > > <y...@streamnative.io.invalid> > > > > > > > >>>> wrote: > > > > > > > >>>> > > > > > > > >>>>> I'm excited to learn much more about metrics when I > started > > > > > reading > > > > > > > >>>>> this proposal. But I became more and more frustrated > when I > > > > found > > > > > > > >>>>> there is still too much content left even if I've already > > > spent > > > > > > much > > > > > > > >>>>> time reading this proposal. I'm wondering how much time > did > > > you > > > > > > expect > > > > > > > >>>>> reviewers to read through this proposal? I just recalled > > the > > > > > > > >>>>> discussion you started before [1]. Did you expect each > PMC > > > > member > > > > > > that > > > > > > > >>>>> gives his/her +1 to read only parts of this proposal? > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>>> I estimated around 2 hours needed for a reviewer. > > > > > > > >>>> I hate it being so long, but I simply couldn't find a way > to > > > > > > downsize it > > > > > > > >>>> more. Furthermore, I consulted with my colleagues > including > > > > > Matteo, > > > > > > but > > > > > > > >> we > > > > > > > >>>> couldn't see a way to scope it down. > > > > > > > >>>> Why? Because once you begin this journey, you need to know > > how > > > > > it's > > > > > > > >> going > > > > > > > >>>> to end. > > > > > > > >>>> What I ended up doing, is writing all the crucial details > > for > > > > > > review in > > > > > > > >>>> the High Level Design section. > > > > > > > >>>> It's still a big, hefty section, but I don't think I can > > step > > > > out > > > > > > or let > > > > > > > >>>> anyone else change Pulsar so invasively without the full > > > extent > > > > of > > > > > > the > > > > > > > >>>> change. > > > > > > > >>>> > > > > > > > >>>> I don't think it's wise to read parts. > > > > > > > >>>> I did my very best effort to minimize it, but the scope is > > > > simply > > > > > > big. > > > > > > > >>>> Open for suggestions, but it requires reading all the PIP > :) > > > > > > > >>>> > > > > > > > >>>> Thanks a lot Yunze for dedicating any time to it. > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>>> > > > > > > > >>>>> Let's talk back to the proposal, for now, what I mainly > > > learned > > > > > and > > > > > > > >>>>> are concerned about mostly are: > > > > > > > >>>>> 1. Pulsar has many ways to expose metrics. It's not > unified > > > and > > > > > > > >> confusing. > > > > > > > >>>>> 2. The current metrics system cannot support a large > amount > > > of > > > > > > topics. > > > > > > > >>>>> 3. It's hard for plugin authors to integrate metrics. > (For > > > > > example, > > > > > > > >>>>> KoP [2] integrates metrics by implementing the > > > > > > > >>>>> PrometheusRawMetricsProvider interface and it indeed > needs > > > much > > > > > > work) > > > > > > > >>>>> > > > > > > > >>>>> Regarding the 1st issue, this proposal chooses > > OpenTelemetry > > > > > > (OTel). > > > > > > > >>>>> > > > > > > > >>>>> Regarding the 2nd issue, I scrolled to the "Why > > > OpenTelemetry?" > > > > > > > >>>>> section. It's still frustrating to see no answer. > > > Eventually, I > > > > > > found > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>>> OpenTelemetry isn't the solution for large amount of > topic. > > > > > > > >>>> The solution is described at > > > > > > > >>>> "Aggregate and Filtering to solve cardinality issues" > > section. > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>>> the explanation in the "What we need to fix in > > OpenTelemetry > > > - > > > > > > > >>>>> Performance" section. It seems that we still need some > > > > > > enhancements in > > > > > > > >>>>> OTel. In other words, currently OTel is not ready for > > > resolving > > > > > all > > > > > > > >>>>> these issues listed in the proposal but we believe it > will. > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>>> Let me rephrase "believe" --> we work together with the > > > > > maintainers > > > > > > to > > > > > > > >> do > > > > > > > >>>> it, yes. > > > > > > > >>>> I am open for any other suggestion. > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>>> > > > > > > > >>>>> As for the 3rd issue, from the "Integrating with Pulsar > > > > Plugins" > > > > > > > >>>>> section, the plugin authors still need to implement the > new > > > > OTel > > > > > > > >>>>> interfaces. Is it much easier than using the existing > ways > > to > > > > > > expose > > > > > > > >>>>> metrics? Could metrics still be easily integrated with > > > Grafana? > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>>> Yes, it's way easier. > > > > > > > >>>> Basically you have a full fledged metrics library objects: > > > > Meter, > > > > > > Gauge, > > > > > > > >>>> Histogram, Counter. > > > > > > > >>>> No more Raw Metrics Provider, writing UTF-8 bytes in > > > Prometheus > > > > > > format. > > > > > > > >>>> You get namespacing for free with Meter name and version. > > > > > > > >>>> It's way better than current solution and any other > library. > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>>> > > > > > > > >>>>> That's all I am concerned about at the moment. I > > understand, > > > > and > > > > > > > >>>>> appreciate that you've spent much time studying and > > > explaining > > > > > all > > > > > > > >>>>> these things. But, this proposal is still too huge. > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>>> I appreciate your effort a lot! > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>>> > > > > > > > >>>>> [1] > > > > > > https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf > > > > > > > >>>>> [2] > > > > > > > >>>>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java > > > > > > > >>>>> > > > > > > > >>>>> Thanks, > > > > > > > >>>>> Yunze > > > > > > > >>>>> > > > > > > > >>>>> On Sun, May 7, 2023 at 5:53 PM Asaf Mesika < > > > > > asaf.mes...@gmail.com> > > > > > > > >> wrote: > > > > > > > >>>>>> > > > > > > > >>>>>> I'm very appreciative for feedback from multiple pulsar > > > users > > > > > and > > > > > > devs > > > > > > > >>>>> on > > > > > > > >>>>>> this PIP, since it has dramatic changes suggested and > > quite > > > > > > extensive > > > > > > > >>>>>> positive change for the users. > > > > > > > >>>>>> > > > > > > > >>>>>> > > > > > > > >>>>>> On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika < > > > > > > asaf.mes...@gmail.com> > > > > > > > >>>>> wrote: > > > > > > > >>>>>> > > > > > > > >>>>>>> Hi all, > > > > > > > >>>>>>> > > > > > > > >>>>>>> I'm very excited to release a PIP I've been working on > in > > > the > > > > > > past 11 > > > > > > > >>>>>>> months, which I think will be immensely valuable to > > Pulsar, > > > > > > which I > > > > > > > >>>>> like so > > > > > > > >>>>>>> much. > > > > > > > >>>>>>> > > > > > > > >>>>>>> PIP: https://github.com/apache/pulsar/issues/20197 > > > > > > > >>>>>>> > > > > > > > >>>>>>> I'm quoting here the preface: > > > > > > > >>>>>>> > > > > > > > >>>>>>> === QUOTE START === > > > > > > > >>>>>>> > > > > > > > >>>>>>> Roughly 11 months ago, I started working on solving the > > > > biggest > > > > > > issue > > > > > > > >>>>> with > > > > > > > >>>>>>> Pulsar metrics: the lack of ability to monitor a pulsar > > > > broker > > > > > > with a > > > > > > > >>>>> large > > > > > > > >>>>>>> topic count: 10k, 100k, and future support of 1M. This > > > > started > > > > > by > > > > > > > >>>>> mapping > > > > > > > >>>>>>> the existing functionality and then enumerating all the > > > > > problems > > > > > > I > > > > > > > >>>>> saw (all > > > > > > > >>>>>>> documented in this doc > > > > > > > >>>>>>> < > > > > > > > >>>>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing > > > > > > > > > > > > > > I thought we were going to stop using Google docs for PIPs. > > > > > > > > > > > > > > >>>>>> > > > > > > > >>>>>>> ). > > > > > > > >>>>>>> > > > > > > > >>>>>>> This PIP is a parent PIP. It aims to gradually solve > > (using > > > > > > sub-PIPs) > > > > > > > >>>>> all > > > > > > > >>>>>>> the current metric system's problems and provide the > > > ability > > > > to > > > > > > > >>>>> monitor a > > > > > > > >>>>>>> broker with a large topic count, which is currently > > > lacking. > > > > > As a > > > > > > > >>>>> parent > > > > > > > >>>>>>> PIP, it will describe each problem and its solution at > a > > > high > > > > > > level, > > > > > > > >>>>>>> leaving fine-grained details to the sub-PIPs. The > parent > > > PIP > > > > > > ensures > > > > > > > >>>>> all > > > > > > > >>>>>>> solutions align and does not contradict each other. > > > > > > > >>>>>>> > > > > > > > >>>>>>> The basic building block to solve the monitoring > ability > > of > > > > > large > > > > > > > >>>>> topic > > > > > > > >>>>>>> count is aggregating internally (to topic groups) and > > > adding > > > > > > > >>>>> fine-grained > > > > > > > >>>>>>> filtering. We could have shoe-horned it into the > existing > > > > > metric > > > > > > > >>>>> system, > > > > > > > >>>>>>> but we thought adding that to a system already > ingrained > > > with > > > > > > many > > > > > > > >>>>> problems > > > > > > > >>>>>>> would be wrong and hard to do gradually, as so many > > things > > > > will > > > > > > > >>>>> break. This > > > > > > > >>>>>>> is why the second-biggest design decision presented > here > > is > > > > > > > >>>>> consolidating > > > > > > > >>>>>>> all existing metric libraries into a single one - > > > > OpenTelemetry > > > > > > > >>>>>>> <https://opentelemetry.io/>. The parent PIP will > explain > > > why > > > > > > > >>>>>>> OpenTelemetry was chosen out of existing solutions and > > why > > > it > > > > > far > > > > > > > >>>>> exceeds > > > > > > > >>>>>>> all other options. I’ve been working closely with the > > > > > > OpenTelemetry > > > > > > > >>>>>>> community in the past eight months: brain-storming this > > > > > > integration, > > > > > > > >>>>> and > > > > > > > >>>>>>> raising issues, in an effort to remove serious blockers > > to > > > > make > > > > > > this > > > > > > > >>>>>>> migration successful. > > > > > > > >>>>>>> > > > > > > > >>>>>>> I made every effort to summarize this document so that > it > > > can > > > > > be > > > > > > > >>>>> concise > > > > > > > >>>>>>> yet clear. I understand it is an effort to read it and, > > > more > > > > > so, > > > > > > > >>>>> provide > > > > > > > >>>>>>> meaningful feedback on such a large document; hence I’m > > > very > > > > > > grateful > > > > > > > >>>>> for > > > > > > > >>>>>>> each individual who does so. > > > > > > > >>>>>>> > > > > > > > >>>>>>> I think this design will help improve the user > experience > > > > > > immensely, > > > > > > > >>>>> so it > > > > > > > >>>>>>> is worth the time spent reading it. > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>>> === QUOTE END === > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>>> Thanks! > > > > > > > >>>>>>> > > > > > > > >>>>>>> Asaf Mesika > > > > > > > >>>>>>> > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >