Hi - > On Sep 24, 2024, at 11:01 AM, Heesung Sohn <hees...@apache.org> wrote: > > Hi all, > > I realized that sampling outlier topic/subscription/consumer/producer > metrics can be useful to operate Pulsar. > > I noticed that Pulsar's performance issues are mostly caused by > outlier topics/subscription/consumers/producers. > > For example, disk usage can grow a lot when consumers are very slow(0 > ack rate and 0 permit) while producers publish messages at a high rate > to the same topic.
This is something you could observe while running an OpenMessaging Benchmark while creating and then draining a backlog. > > I think sampling such outlier metrics can be useful to monitor Pulsar. > Do we think this is a separate issue and deserves another PIP or > should it be part of this PIP-264 work? Even if it were not part of PIP-264 what about “outlier” metrics would require a new PIP? Perhaps a better way of understanding such outliers is to take flame graphs of the brokers and bookkeepers you are testing with the OMB? Best, Dave > > Thanks, > Heesung > > On Mon, Aug 28, 2023 at 7:48 AM Asaf Mesika <asaf.mes...@gmail.com> wrote: >> >> I've relocated the PIP content from the issue ( >> https://github.com/apache/pulsar/issues/20197) to a PR ( >> https://github.com/apache/pulsar/pull/21080) so I could add TOC and also be >> inlined with the new process. >> >> >> >> On Mon, Aug 28, 2023 at 5:46 PM Asaf Mesika <asaf.mes...@gmail.com> wrote: >> >>> Thanks for taking the time to review the document - *highly appreciated*. >>> I'm inlined my comments below. >>> >>> >>> On Mon, Aug 21, 2023 at 12:19 PM Hang Chen <chenh...@apache.org> wrote: >>> >>>> Hi Asaf, >>>> Thanks for bring up the great proposal. >>>> >>>> After reading this proposal, I have the following questions. >>>> 1. This proposal will introduce a big break change in Pulsar, >>>> especially in code perspective. I’m interested in how to support both >>>> old and new implementation at the same time step by step? >>>> >>>>> We will keep the current metric system as is, and add a new layer of >>>> metrics using OpenTelemetry Java SDK. All of Pulsar’s metrics will be >>>> create also using OpenTelemetry. A feature flag will allow enabling >>>> OpenTelemetry metrics (init, recording and exporting). All the features and >>>> changes described here will be done only in the OpenTelemetry layer, >>>> allowing to keep the old version working until you’re ready to switch using >>>> the OTel (OpenTelemetry) implementation. In the far future, once OTel usage >>>> has stabilized and became widely adopted we’ll deprecate current metric >>>> system and eventually remove it. We will also make sure there is feature >>>> flag to turn off current Prometheus based metric system. >>>> >>>> >>> Current metrics code remains as is, untouched. >>> I'm adding new code, using OpenTelemetry API and SDK. The code in most >>> cases will read the existing variables (like msgsReceived), and in other >>> cases will setup its own new objects like Counter, Histogram and *also* >>> record values to them. >>> You can take a look at the revised PIP as I've added tiny code sample to >>> be use as an idea how it will look like. Look here >>> <https://github.com/apache/pulsar/blob/6ec0bde4127a54ab8e8bb67fb091c932fa2952a4/pip/pip-264.md#consolidating-to-opentelemetry> >>> . >>> >>> >>> >>>> 2. We introduced Group and filter logic in the metric system, and I >>>> have the following concerns. >>>> - We need to add protection logic or pre-validation for the group and >>>> filter rules to avoid users mis-configured causes huge performance >>>> impaction on Pulsar brokers >>>> >>>> >>> Good call. I've added a note in the PIP, that we will reject any filter >>> rules update if the expected number of data points exceed certain >>> threshold. I left this as detail to be specified in the sub-PIP. >>> >>> - We need to support expose all the topic-level metrics when the >>>> Pulsar cluster just has thounds of topics >>>> >>>> I've added a new goal: "- New system should support the maximum >>> supported number of topics in current system (i.e. 4k topics) without >>> filtering" >>> >>> >>>> - Even though we introduced group and filter for the metrics, we still >>>> can’t resolve large number of metrics exposed to Prometheus. Exposing >>>> large a mount of data (100MB+) throughput HTTP endpoint in >>>> ineffective. We can consider expose those metric data by Pulsar topic >>>> and develop a Pulsar to Prometheus connector to write Pulsar metric >>>> data to Prometheus in streaming mode instead of batch mode to reduce >>>> the performace impaction >>>> >>> >>> As I wrote in my PIP. If you find your self exporting 100MB or more of >>> metric data *every* 30 seconds you will suffer from: >>> * High cost of TSDB holding that (e.g. Prometheus, Cortex, VictoriaMetrics) >>> * Query time out since there is too much data to read >>> >>> Also, the bottleneck is not transfer time over the wire. It's mostly the >>> memory needed by any TSDB to hold it for at least 2 hours before flushing >>> it to disk - this it the most expensive of all. >>> >>> At 100mb response size, filtering and grouping are a must. >>> >>> >>> >>>> >>>> - Group and filter logic uses regular expressions extensively in >>>> rules. Regular expression parsing and matching are CPU and time >>>> intensive operations. We have push-down filter to reduce the generated >>>> metrics number, but still can’t solve the regular expression matching >>>> issues. If the user provide a complex regular expression for group and >>>> filter rule, the metric generating thread will be the bottleneck and >>>> will block other threads if we use synchronous call. >>>> >>>> >>> I plan to use caching as wrote in the PIP. Roughly (instrument, >>> attributes) -> boolean. It's basically as if we are adding one boolean to >>> PersistentTopic class - it has so many properties and size added is >>> negligible. >>> >>> >>>> - Group and filter rule is a litter complex for users and we need to >>>> provide a UI or tool to help user write the correct and effective >>>> rules and show the new rules impaction on old rules. >>>> >>> >>> We don't have UI for now in Pulsar. We will make sure pulsar CLI will be >>> convenient enough. >>> >>> >>>> >>>> >>>> Thanks, >>>> Hang >>>> >>>> Matteo Merli <matteo.me...@gmail.com> 于2023年6月15日周四 23:14写道: >>>>> >>>>>> Proposing a large breaking change (even if it's crucial) is the single >>>>> fastest way to motivate your users to migrate to a different platform. I >>>>> wish it wasn't the case, but it's the cold reality. >>>>> >>>>> If you read the proposal, there is no real breaking change. There will >>>> be a >>>>> switch to choose the existing metrics or the new ones. The dashboards >>>> will >>>>> be updated and provided. >>>>> >>>>> At the same time, the best sure way to motivate users to switch or not >>>>> adopt a platform is to stick with confusing/inconsistent APIs/Metrics. >>>>> >>>>> >>>>> -- >>>>> Matteo Merli >>>>> <matteo.me...@gmail.com> >>>>> >>>>> >>>>> On Wed, Jun 14, 2023 at 6:10 PM Devin Bost <devin.b...@gmail.com> >>>> wrote: >>>>> >>>>>>> Thanks for the details, Devin. Curios - 'We' stands for which >>>> company? >>>>>> >>>>>> What do you mean? I was quoting Rajan when I said, "we." >>>>>> >>>>>> >>>>>> Devin G. Bost >>>>>> >>>>>> >>>>>> On Wed, Jun 14, 2023 at 10:02 AM Asaf Mesika <asaf.mes...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the details, Devin. Curios - 'We' stands for which >>>> company? >>>>>>> >>>>>>> Can you take a look at my previous response to see if it answers the >>>>>>> concern you raised? >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 14, 2023 at 1:49 PM Devin Bost <devin.b...@gmail.com> >>>> wrote: >>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Are we proposing a change to break existing metrics >>>> compatibility >>>>>>>>> (prometheus)? If that is the case then it's a big red flag as >>>> it will >>>>>>> be >>>>>>>> a >>>>>>>>> pain for any company to upgrade Pulsar as monitoring is THE most >>>>>>>> important >>>>>>>>> part of the system and we don't even want to break >>>> compatibility for >>>>>>> any >>>>>>>>> small things to avoid interruption for users that are using >>>> Pulsar >>>>>>>> system. >>>>>>>>> I think it's always good to enhance a system by maintaining >>>>>>> compatibility >>>>>>>>> and I would be fine if we can introduce new metrics API without >>>>>> causing >>>>>>>> ANY >>>>>>>>> interruption to existing metrics API. But if we can't maintain >>>>>>>>> compatibility then it's a big red flag and not acceptable for >>>> the >>>>>>> Pulsar >>>>>>>>> community. >>>>>>>> >>>>>>>> Proposing a large breaking change (even if it's crucial) is the >>>> single >>>>>>>> fastest way to motivate your users to migrate to a different >>>> platform. >>>>>> I >>>>>>>> wish it wasn't the case, but it's the cold reality. >>>>>>>> >>>>>>>> With that said, I'm a big proponent of Open Telemetry. I did a big >>>>>> video >>>>>>> a >>>>>>>> while back that some of you may remember on the use of Open >>>> Tracing >>>>>>> (before >>>>>>>> it was merged into Open Telemetry). Open Telemetry has gained >>>>>>> considerable >>>>>>>> momentum in the industry since then. >>>>>>>> >>>>>>>> I'm also very interested in a solution to the metrics problem. >>>> I've run >>>>>>>> into the scalability issues with metrics in production, and I've >>>> been >>>>>>> very >>>>>>>> concerned about the metrics bottlenecks around our ability to >>>> deliver >>>>>> our >>>>>>>> promises around supporting large numbers of topics. One of the big >>>>>>>> advantages of Pulsar over Kafka is supposed to be that topics are >>>>>> cheap, >>>>>>>> but as it stands, our current metrics design gets seriously in >>>> the way >>>>>> of >>>>>>>> that. Generally speaking, I'm open to solutions, especially if >>>> they >>>>>> align >>>>>>>> us with a growing industry standard. >>>>>>>> >>>>>>>> - Devin >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jun 14, 2023, 3:28 AM Enrico Olivelli < >>>> eolive...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>>> Il Mer 14 Giu 2023, 04:33 Rajan Dhabalia <rdhaba...@apache.org> >>>> ha >>>>>>>>> scritto: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Are we proposing a change to break existing metrics >>>> compatibility >>>>>>>>>> (prometheus)? If that is the case then it's a big red flag as >>>> it >>>>>> will >>>>>>>> be >>>>>>>>> a >>>>>>>>>> pain for any company to upgrade Pulsar as monitoring is THE >>>> most >>>>>>>>> important >>>>>>>>>> part of the system and we don't even want to break >>>> compatibility >>>>>> for >>>>>>>> any >>>>>>>>>> small things to avoid interruption for users that are using >>>> Pulsar >>>>>>>>> system. >>>>>>>>>> I think it's always good to enhance a system by maintaining >>>>>>>> compatibility >>>>>>>>>> and I would be fine if we can introduce new metrics API >>>> without >>>>>>> causing >>>>>>>>> ANY >>>>>>>>>> interruption to existing metrics API. But if we can't maintain >>>>>>>>>> compatibility then it's a big red flag and not acceptable for >>>> the >>>>>>>> Pulsar >>>>>>>>>> community. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I agree. >>>>>>>>> >>>>>>>>> If it is possible to export data Ina way that is compatible with >>>>>>>> Prometheus >>>>>>>>> without adding too much overhead then I would support this work. >>>>>>>>> >>>>>>>>> About renaming the metrics: we can do it only if tue changes for >>>>>> users >>>>>>>> are >>>>>>>>> as trivial as replacing the queries in the grafana dashboard or >>>> in >>>>>>>> alerting >>>>>>>>> systems. >>>>>>>>> >>>>>>>>> Asaf, do you have prototype? Built over any version of Pulsar? >>>>>>>>> >>>>>>>>> Also, it would be very useful to start an initiative to collect >>>> the >>>>>>> list >>>>>>>> of >>>>>>>>> metrics that people really use in production, especially for >>>>>> automated >>>>>>>>> alerts. >>>>>>>>> >>>>>>>>> In my experience you usually care about: >>>>>>>>> - in/out traffic (rates, bytes...) >>>>>>>>> - number of producer, consumers, topics, subscriptions... >>>>>>>>> - backlog >>>>>>>>> - jvm metrics >>>>>>>>> - function custom metrics >>>>>>>>> >>>>>>>>> >>>>>>>>> Enrico >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Rajan >>>>>>>>>> >>>>>>>>>> On Sun, May 21, 2023 at 9:01 AM Asaf Mesika < >>>> asaf.mes...@gmail.com >>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks for the reply, Enrico. >>>>>>>>>>> Completely agree. >>>>>>>>>>> This made me realize my TL;DR wasn't talking about export. >>>>>>>>>>> I added this to it: >>>>>>>>>>> >>>>>>>>>>> --- >>>>>>>>>>> Pulsar OTel Metrics will support exporting as Prometheus >>>> HTTP >>>>>>>> endpoint >>>>>>>>>>> (`/metrics` but different port) for backward compatibility >>>> and >>>>>> also >>>>>>>>> OLTP, >>>>>>>>>>> so you can push the metrics to OTel Collector and from >>>> there ship >>>>>>> it >>>>>>>> to >>>>>>>>>> any >>>>>>>>>>> destination. >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>>> OTel supports two kinds of exporter: Prometheus (HTTP) and >>>> OTLP >>>>>>>> (push). >>>>>>>>>>> We'll just configure to use them. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, May 15, 2023 at 10:35 AM Enrico Olivelli < >>>>>>>> eolive...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Asaf, >>>>>>>>>>>> thanks for contributing in this area. >>>>>>>>>>>> Metrics are a fundamental feature of Pulsar. >>>>>>>>>>>> >>>>>>>>>>>> Currently I find it very awkward to maintain metrics, and >>>> also >>>>>> I >>>>>>>> see >>>>>>>>>>>> it as a problem to support only Prometheus. >>>>>>>>>>>> >>>>>>>>>>>> Regarding your proposal, IIRC in the past someone else >>>> proposed >>>>>>> to >>>>>>>>>>>> support other metrics systems and they have been >>>> suggested to >>>>>>> use a >>>>>>>>>>>> sidecar approach, >>>>>>>>>>>> that is to add something next to Pulsar services that >>>> served >>>>>> the >>>>>>>>>>>> metrics in the preferred format/way. >>>>>>>>>>>> I find that the sidecar approach is too inefficient and I >>>> am >>>>>> not >>>>>>>>>>>> proposing it (but I wanted to add this reference for the >>>>>> benefit >>>>>>> of >>>>>>>>>>>> new people on the list). >>>>>>>>>>>> >>>>>>>>>>>> I wonder if it would be possible to keep compatibility >>>> with the >>>>>>>>>>>> current Prometheus based metrics. >>>>>>>>>>>> Now Pulsar reached a point in which is is widely used by >>>> many >>>>>>>>>>>> companies and also with big clusters, >>>>>>>>>>>> telling people that they have to rework all the >>>> infrastructure >>>>>>>>> related >>>>>>>>>>>> to metrics because we don't support Prometheus anymore or >>>>>> because >>>>>>>> we >>>>>>>>>>>> changed radically the way we publish metrics >>>>>>>>>>>> It is a step that seems too hard from my point of view. >>>>>>>>>>>> >>>>>>>>>>>> Currently I believe that compatibility is more important >>>> than >>>>>>>>>>>> versatility, and if we want to introduce new (and far >>>> better) >>>>>>>>> features >>>>>>>>>>>> we must take it into account. >>>>>>>>>>>> >>>>>>>>>>>> So my point is that I generally support the idea of >>>> opening the >>>>>>> way >>>>>>>>> to >>>>>>>>>>>> Open Telemetry, but we must have a way to not force all >>>> of our >>>>>>>> users >>>>>>>>>>>> to throw away their alerting systems, dashboards and >>>> know-how >>>>>> in >>>>>>>>>>>> troubleshooting Pulsar problems in production and dev >>>>>>>>>>>> >>>>>>>>>>>> Best regards >>>>>>>>>>>> Enrico >>>>>>>>>>>> >>>>>>>>>>>> Il giorno lun 15 mag 2023 alle ore 02:17 Dave Fisher >>>>>>>>>>>> <wave4d...@comcast.net> ha scritto: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On May 10, 2023, at 1:01 AM, Asaf Mesika < >>>>>>>> asaf.mes...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, May 9, 2023 at 11:29 PM Dave Fisher < >>>>>>> w...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On May 8, 2023, at 2:49 AM, Asaf Mesika < >>>>>>>>> asaf.mes...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Your feedback made me realized I need to add "TL;DR" >>>>>>> section, >>>>>>>>>>> which I >>>>>>>>>>>>>>> just >>>>>>>>>>>>>>>> added. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm quoting it here. It gives a brief summary of the >>>>>>>> proposal, >>>>>>>>>>> which >>>>>>>>>>>>>>>> requires up to 5 min of read time, helping you get >>>> a high >>>>>>>> level >>>>>>>>>>>> picture >>>>>>>>>>>>>>>> before you dive into the >>>> background/motivation/solution. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ---------------------- >>>>>>>>>>>>>>>> TL;DR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Working with Metrics today as a user or a developer >>>> is >>>>>> hard >>>>>>>> and >>>>>>>>>> has >>>>>>>>>>>> many >>>>>>>>>>>>>>>> severe issues. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> From the user perspective: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - One of Pulsar strongest feature is "cheap" >>>> topics so >>>>>> you >>>>>>>> can >>>>>>>>>>>> easily >>>>>>>>>>>>>>>> have 10k - 100k topics per broker. Once you do >>>> that, you >>>>>>>>> quickly >>>>>>>>>>>> learn >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> the amount of metrics you export via "/metrics" >>>>>>> (Prometheus >>>>>>>>>> style >>>>>>>>>>>>>>> endpoint) >>>>>>>>>>>>>>>> becomes really big. The cost to store them becomes >>>> too >>>>>>> high, >>>>>>>>>>> queries >>>>>>>>>>>>>>>> time-out or even "/metrics" endpoint it self times >>>> out. >>>>>>>>>>>>>>>> The only option Pulsar gives you today is >>>> all-or-nothing >>>>>>>>>> filtering >>>>>>>>>>>> and >>>>>>>>>>>>>>>> very crude aggregation. You switch metrics from >>>> topic >>>>>>>>>> aggregation >>>>>>>>>>>>>>> level to >>>>>>>>>>>>>>>> namespace aggregation level. Also you can turn off >>>>>>> producer >>>>>>>>> and >>>>>>>>>>>>>>> consumer >>>>>>>>>>>>>>>> level metrics. You end up doing it all leaving you >>>>>>> "blind", >>>>>>>>>>> looking >>>>>>>>>>>> at >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> metrics from a namespace level which is too high >>>> level. >>>>>>> You >>>>>>>>> end >>>>>>>>>> up >>>>>>>>>>>>>>>> conjuring all kinds of scripts on top of topic >>>> stats >>>>>>>> endpoint >>>>>>>>> to >>>>>>>>>>>> glue >>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>> aggregated metrics view for the topics you need. >>>>>>>>>>>>>>>> - Summaries (metric type giving you quantiles like >>>> p95) >>>>>>>> which >>>>>>>>>> are >>>>>>>>>>>> used >>>>>>>>>>>>>>>> in Pulsar, can't be aggregated across topics / >>>> brokers >>>>>> due >>>>>>>> its >>>>>>>>>>>> inherent >>>>>>>>>>>>>>>> design. >>>>>>>>>>>>>>>> - Plugin authors spend too much time on defining >>>> and >>>>>>>> exposing >>>>>>>>>>>> metrics >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> Pulsar since the only interface Pulsar offers is >>>> writing >>>>>>>> your >>>>>>>>>>>> metrics >>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>> your self as UTF-8 bytes in Prometheus Text Format >>>> to >>>>>> byte >>>>>>>>>> stream >>>>>>>>>>>>>>> interface >>>>>>>>>>>>>>>> given to you. >>>>>>>>>>>>>>>> - Pulsar histograms are exported in a way that is >>>> not >>>>>>>>> conformant >>>>>>>>>>>> with >>>>>>>>>>>>>>>> Prometheus, which means you can't get the p95 >>>> quantile >>>>>> on >>>>>>>> such >>>>>>>>>>>>>>> histograms, >>>>>>>>>>>>>>>> making them very hard to use in day to day life. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What version of DataSketches is used to produce the >>>>>>> histogram? >>>>>>>>> Is >>>>>>>>>> is >>>>>>>>>>>> still >>>>>>>>>>>>>>> an old Yahoo one, or are we using an updated one from >>>>>> Apache >>>>>>>>>>>> DataSketches? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Seems like this is a single PR/small PIP for 3.1? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Histograms are a list of buckets, each is a counter. >>>>>>>>>>>>>> Summary is a collection of values collected over a >>>> time >>>>>>> window, >>>>>>>>>> which >>>>>>>>>>>> at >>>>>>>>>>>>>> the end you get a calculation of the quantiles of >>>> those >>>>>>> values: >>>>>>>>>> p95, >>>>>>>>>>>> p50, >>>>>>>>>>>>>> and those are exported from Pulsar. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Pulsar histogram do not use Data Sketches. >>>>>>>>>>>>> >>>>>>>>>>>>> Bookkeeper Metrics wraps Yahoo DataSketches last I >>>> checked. >>>>>>>>>>>>> >>>>>>>>>>>>>> They are just counters. >>>>>>>>>>>>>> They are not adhere to Prometheus since: >>>>>>>>>>>>>> a. The counter is expected to be cumulative, but >>>> Pulsar >>>>>>> resets >>>>>>>>> each >>>>>>>>>>>> bucket >>>>>>>>>>>>>> counter to 0 every 1 min >>>>>>>>>>>>>> b. The bucket upper range is expected to be written >>>> as an >>>>>>>>> attribute >>>>>>>>>>>> "le" >>>>>>>>>>>>>> but today it is encoded in the name of the metric >>>> itself. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is a breaking change, hence hard to mark in any >>>> small >>>>>>>>> release. >>>>>>>>>>>>>> This is why it's part of this PIP since so many >>>> things will >>>>>>>>> break, >>>>>>>>>>> and >>>>>>>>>>>> all >>>>>>>>>>>>>> of them will break on a separate layer (OTel metrics), >>>>>> hence >>>>>>>> not >>>>>>>>>>> break >>>>>>>>>>>>>> anyone without their consent. >>>>>>>>>>>>> >>>>>>>>>>>>> If this change will break existing Grafana dashboards >>>> and >>>>>> other >>>>>>>>>>>> operational monitoring already in place then it will break >>>>>>>> guarantees >>>>>>>>>> we >>>>>>>>>>>> have made about safely being able to downgrade from a bad >>>>>>> upgrade. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Too many metrics are rates which also delta reset >>>>>> every >>>>>>>>>> interval >>>>>>>>>>>> you >>>>>>>>>>>>>>>> configure in Pulsar and restart, instead of >>>> relying on >>>>>>>>>> cumulative >>>>>>>>>>>> (ever >>>>>>>>>>>>>>>> growing) counters and let Prometheus use its rate >>>>>>> function. >>>>>>>>>>>>>>>> - and many more issues >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> From the developer perspective: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - There are 4 different ways to define and record >>>>>> metrics >>>>>>> in >>>>>>>>>>> Pulsar: >>>>>>>>>>>>>>>> Pulsar own metrics library, Prometheus Java Client, >>>>>>>> Bookkeeper >>>>>>>>>>>> metrics >>>>>>>>>>>>>>>> library and plain native Java SDK objects >>>> (AtomicLong, >>>>>>> ...). >>>>>>>>>> It's >>>>>>>>>>>> very >>>>>>>>>>>>>>>> confusing for the developer and create >>>> inconsistencies >>>>>> for >>>>>>>> the >>>>>>>>>> end >>>>>>>>>>>> user >>>>>>>>>>>>>>>> (e.g. Summary for example is different in each). >>>>>>>>>>>>>>>> - Patching your metrics into "/metrics" Prometheus >>>>>>> endpoint >>>>>>>> is >>>>>>>>>>>>>>>> confusing, cumbersome and error prone. >>>>>>>>>>>>>>>> - many more >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This proposal offers several key changes to solve >>>> that: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Cardinality (supporting 10k-100k topics per >>>> broker) is >>>>>>>>> solved >>>>>>>>>> by >>>>>>>>>>>>>>>> introducing a new aggregation level for metrics >>>> called >>>>>>> Topic >>>>>>>>>>> Metric >>>>>>>>>>>>>>> Group. >>>>>>>>>>>>>>>> Using configuration, you specify for each topic its >>>>>> group >>>>>>>>> (using >>>>>>>>>>>>>>>> wildcard/regex). This allows you to "zoom" out to >>>> a more >>>>>>>>>> detailed >>>>>>>>>>>>>>>> granularity level like groups instead of >>>> namespaces, >>>>>> which >>>>>>>> you >>>>>>>>>>>> control >>>>>>>>>>>>>>> how >>>>>>>>>>>>>>>> many groups you'll have hence solving the >>>> cardinality >>>>>>> issue, >>>>>>>>>>> without >>>>>>>>>>>>>>>> sacrificing level of detail too much. >>>>>>>>>>>>>>>> - Fine-grained filtering mechanism, dynamic. >>>> You'll have >>>>>>>>>>> rule-based >>>>>>>>>>>>>>>> dynamic configuration, allowing you to specify per >>>>>>>>>>>>>>> namespace/topic/group >>>>>>>>>>>>>>>> which metrics you'd like to keep/drop. Rules >>>> allows you >>>>>> to >>>>>>>> set >>>>>>>>>> the >>>>>>>>>>>>>>> default >>>>>>>>>>>>>>>> to have small amount of metrics in group and >>>> namespace >>>>>>> level >>>>>>>>>> only >>>>>>>>>>>> and >>>>>>>>>>>>>>> drop >>>>>>>>>>>>>>>> the rest. When needed, you can add an override >>>> rule to >>>>>>>> "open" >>>>>>>>>> up a >>>>>>>>>>>>>>> certain >>>>>>>>>>>>>>>> group to have more metrics in higher granularity >>>> (topic >>>>>> or >>>>>>>>> even >>>>>>>>>>>>>>>> consumer/producer level). Since it's dynamic you >>>> "open" >>>>>>>> such a >>>>>>>>>>> group >>>>>>>>>>>>>>> when >>>>>>>>>>>>>>>> you see it's misbehaving, see it in topic level, >>>> and >>>>>> when >>>>>>>> all >>>>>>>>>>>>>>> resolved, you >>>>>>>>>>>>>>>> can "close" it. A bit similar experience to logging >>>>>> levels >>>>>>>> in >>>>>>>>>>> Log4j >>>>>>>>>>>> or >>>>>>>>>>>>>>>> Logback, that you default and override per >>>>>> class/package. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Aggregation and Filtering combined solves the >>>> cardinality >>>>>>>>> without >>>>>>>>>>>>>>>> sacrificing the level of detail when needed and most >>>>>>>>> importantly, >>>>>>>>>>> you >>>>>>>>>>>>>>>> determine which topic/group/namespace it happens on. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Since this change is so invasive, it requires a >>>> single >>>>>>>> metrics >>>>>>>>>>>> library to >>>>>>>>>>>>>>>> implement all of it on top of; Hence the third big >>>> change >>>>>>>> point >>>>>>>>>> is >>>>>>>>>>>>>>>> consolidating all four ways to define and record >>>> metrics >>>>>>> to a >>>>>>>>>>> single >>>>>>>>>>>>>>> one, a >>>>>>>>>>>>>>>> new one: OpenTelemtry Metrics (Java SDK, and also >>>> Python >>>>>>> and >>>>>>>> Go >>>>>>>>>> for >>>>>>>>>>>> the >>>>>>>>>>>>>>>> Pulsar Function runners). >>>>>>>>>>>>>>>> Introducing OpenTelemetry (OTel) solves also the >>>> biggest >>>>>>> pain >>>>>>>>>> point >>>>>>>>>>>> from >>>>>>>>>>>>>>>> the developer perspective, since it's a superb >>>> metrics >>>>>>>> library >>>>>>>>>>>> offering >>>>>>>>>>>>>>>> everything you need, and there is going to be a >>>> single >>>>>> way >>>>>>> - >>>>>>>>> only >>>>>>>>>>> it. >>>>>>>>>>>>>>> Also, >>>>>>>>>>>>>>>> it solves the robustness for Plugin author which >>>> will use >>>>>>>>>>>> OpenTelemetry. >>>>>>>>>>>>>>> It >>>>>>>>>>>>>>>> so happens that it also solves all the numerous >>>> problems >>>>>>>>>> described >>>>>>>>>>>> in the >>>>>>>>>>>>>>>> doc itself. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The solution will be introduced as another layer >>>> with >>>>>>> feature >>>>>>>>>>>> toggles, so >>>>>>>>>>>>>>>> you can work with existing system, and/or OTel, >>>> until >>>>>>>> gradually >>>>>>>>>>>>>>> deprecating >>>>>>>>>>>>>>>> existing system. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It's a big breaking change for Pulsar users on many >>>>>> fronts: >>>>>>>>>> names, >>>>>>>>>>>>>>>> semantics, configuration. Read at the end of this >>>> doc to >>>>>>>> learn >>>>>>>>>>>> exactly >>>>>>>>>>>>>>> what >>>>>>>>>>>>>>>> will change for the user (in high level). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In my opinion, it will make Pulsar user experience >>>> so >>>>>> much >>>>>>>>>> better, >>>>>>>>>>>> they >>>>>>>>>>>>>>>> will want to migrate to it, despite the breaking >>>> change. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This was a very short summary. You are most >>>> welcomed to >>>>>>> read >>>>>>>>> the >>>>>>>>>>> full >>>>>>>>>>>>>>>> design document below and express feedback, so we >>>> can >>>>>> make >>>>>>> it >>>>>>>>>>> better. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sun, May 7, 2023 at 7:52 PM Asaf Mesika < >>>>>>>>>> asaf.mes...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sun, May 7, 2023 at 4:23 PM Yunze Xu >>>>>>>>>>>> <y...@streamnative.io.invalid> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm excited to learn much more about metrics when >>>> I >>>>>>> started >>>>>>>>>>> reading >>>>>>>>>>>>>>>>>> this proposal. But I became more and more >>>> frustrated >>>>>>> when I >>>>>>>>>> found >>>>>>>>>>>>>>>>>> there is still too much content left even if I've >>>>>> already >>>>>>>>> spent >>>>>>>>>>>> much >>>>>>>>>>>>>>>>>> time reading this proposal. I'm wondering how >>>> much time >>>>>>> did >>>>>>>>> you >>>>>>>>>>>> expect >>>>>>>>>>>>>>>>>> reviewers to read through this proposal? I just >>>>>> recalled >>>>>>>> the >>>>>>>>>>>>>>>>>> discussion you started before [1]. Did you expect >>>> each >>>>>>> PMC >>>>>>>>>> member >>>>>>>>>>>> that >>>>>>>>>>>>>>>>>> gives his/her +1 to read only parts of this >>>> proposal? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I estimated around 2 hours needed for a reviewer. >>>>>>>>>>>>>>>>> I hate it being so long, but I simply couldn't >>>> find a >>>>>> way >>>>>>> to >>>>>>>>>>>> downsize it >>>>>>>>>>>>>>>>> more. Furthermore, I consulted with my colleagues >>>>>>> including >>>>>>>>>>> Matteo, >>>>>>>>>>>> but >>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>> couldn't see a way to scope it down. >>>>>>>>>>>>>>>>> Why? Because once you begin this journey, you need >>>> to >>>>>> know >>>>>>>> how >>>>>>>>>>> it's >>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>> to end. >>>>>>>>>>>>>>>>> What I ended up doing, is writing all the crucial >>>>>> details >>>>>>>> for >>>>>>>>>>>> review in >>>>>>>>>>>>>>>>> the High Level Design section. >>>>>>>>>>>>>>>>> It's still a big, hefty section, but I don't think >>>> I can >>>>>>>> step >>>>>>>>>> out >>>>>>>>>>>> or let >>>>>>>>>>>>>>>>> anyone else change Pulsar so invasively without >>>> the full >>>>>>>>> extent >>>>>>>>>> of >>>>>>>>>>>> the >>>>>>>>>>>>>>>>> change. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I don't think it's wise to read parts. >>>>>>>>>>>>>>>>> I did my very best effort to minimize it, but the >>>> scope >>>>>> is >>>>>>>>>> simply >>>>>>>>>>>> big. >>>>>>>>>>>>>>>>> Open for suggestions, but it requires reading all >>>> the >>>>>> PIP >>>>>>> :) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks a lot Yunze for dedicating any time to it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Let's talk back to the proposal, for now, what I >>>> mainly >>>>>>>>> learned >>>>>>>>>>> and >>>>>>>>>>>>>>>>>> are concerned about mostly are: >>>>>>>>>>>>>>>>>> 1. Pulsar has many ways to expose metrics. It's >>>> not >>>>>>> unified >>>>>>>>> and >>>>>>>>>>>>>>> confusing. >>>>>>>>>>>>>>>>>> 2. The current metrics system cannot support a >>>> large >>>>>>> amount >>>>>>>>> of >>>>>>>>>>>> topics. >>>>>>>>>>>>>>>>>> 3. It's hard for plugin authors to integrate >>>> metrics. >>>>>>> (For >>>>>>>>>>> example, >>>>>>>>>>>>>>>>>> KoP [2] integrates metrics by implementing the >>>>>>>>>>>>>>>>>> PrometheusRawMetricsProvider interface and it >>>> indeed >>>>>>> needs >>>>>>>>> much >>>>>>>>>>>> work) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regarding the 1st issue, this proposal chooses >>>>>>>> OpenTelemetry >>>>>>>>>>>> (OTel). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regarding the 2nd issue, I scrolled to the "Why >>>>>>>>> OpenTelemetry?" >>>>>>>>>>>>>>>>>> section. It's still frustrating to see no answer. >>>>>>>>> Eventually, I >>>>>>>>>>>> found >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> OpenTelemetry isn't the solution for large amount >>>> of >>>>>>> topic. >>>>>>>>>>>>>>>>> The solution is described at >>>>>>>>>>>>>>>>> "Aggregate and Filtering to solve cardinality >>>> issues" >>>>>>>> section. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> the explanation in the "What we need to fix in >>>>>>>> OpenTelemetry >>>>>>>>> - >>>>>>>>>>>>>>>>>> Performance" section. It seems that we still need >>>> some >>>>>>>>>>>> enhancements in >>>>>>>>>>>>>>>>>> OTel. In other words, currently OTel is not ready >>>> for >>>>>>>>> resolving >>>>>>>>>>> all >>>>>>>>>>>>>>>>>> these issues listed in the proposal but we >>>> believe it >>>>>>> will. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Let me rephrase "believe" --> we work together >>>> with the >>>>>>>>>>> maintainers >>>>>>>>>>>> to >>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>> it, yes. >>>>>>>>>>>>>>>>> I am open for any other suggestion. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> As for the 3rd issue, from the "Integrating with >>>> Pulsar >>>>>>>>>> Plugins" >>>>>>>>>>>>>>>>>> section, the plugin authors still need to >>>> implement the >>>>>>> new >>>>>>>>>> OTel >>>>>>>>>>>>>>>>>> interfaces. Is it much easier than using the >>>> existing >>>>>>> ways >>>>>>>> to >>>>>>>>>>>> expose >>>>>>>>>>>>>>>>>> metrics? Could metrics still be easily integrated >>>> with >>>>>>>>> Grafana? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes, it's way easier. >>>>>>>>>>>>>>>>> Basically you have a full fledged metrics library >>>>>> objects: >>>>>>>>>> Meter, >>>>>>>>>>>> Gauge, >>>>>>>>>>>>>>>>> Histogram, Counter. >>>>>>>>>>>>>>>>> No more Raw Metrics Provider, writing UTF-8 bytes >>>> in >>>>>>>>> Prometheus >>>>>>>>>>>> format. >>>>>>>>>>>>>>>>> You get namespacing for free with Meter name and >>>>>> version. >>>>>>>>>>>>>>>>> It's way better than current solution and any other >>>>>>> library. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> That's all I am concerned about at the moment. I >>>>>>>> understand, >>>>>>>>>> and >>>>>>>>>>>>>>>>>> appreciate that you've spent much time studying >>>> and >>>>>>>>> explaining >>>>>>>>>>> all >>>>>>>>>>>>>>>>>> these things. But, this proposal is still too >>>> huge. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I appreciate your effort a lot! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>> >>>>>> https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf >>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Yunze >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sun, May 7, 2023 at 5:53 PM Asaf Mesika < >>>>>>>>>>> asaf.mes...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm very appreciative for feedback from multiple >>>>>> pulsar >>>>>>>>> users >>>>>>>>>>> and >>>>>>>>>>>> devs >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>> this PIP, since it has dramatic changes >>>> suggested and >>>>>>>> quite >>>>>>>>>>>> extensive >>>>>>>>>>>>>>>>>>> positive change for the users. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika < >>>>>>>>>>>> asaf.mes...@gmail.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm very excited to release a PIP I've been >>>> working >>>>>> on >>>>>>> in >>>>>>>>> the >>>>>>>>>>>> past 11 >>>>>>>>>>>>>>>>>>>> months, which I think will be immensely >>>> valuable to >>>>>>>> Pulsar, >>>>>>>>>>>> which I >>>>>>>>>>>>>>>>>> like so >>>>>>>>>>>>>>>>>>>> much. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> PIP: >>>> https://github.com/apache/pulsar/issues/20197 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm quoting here the preface: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> === QUOTE START === >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Roughly 11 months ago, I started working on >>>> solving >>>>>> the >>>>>>>>>> biggest >>>>>>>>>>>> issue >>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>> Pulsar metrics: the lack of ability to monitor a >>>>>> pulsar >>>>>>>>>> broker >>>>>>>>>>>> with a >>>>>>>>>>>>>>>>>> large >>>>>>>>>>>>>>>>>>>> topic count: 10k, 100k, and future support of >>>> 1M. >>>>>> This >>>>>>>>>> started >>>>>>>>>>> by >>>>>>>>>>>>>>>>>> mapping >>>>>>>>>>>>>>>>>>>> the existing functionality and then enumerating >>>> all >>>>>> the >>>>>>>>>>> problems >>>>>>>>>>>> I >>>>>>>>>>>>>>>>>> saw (all >>>>>>>>>>>>>>>>>>>> documented in this doc >>>>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing >>>>>>>>>>>>> >>>>>>>>>>>>> I thought we were going to stop using Google docs for >>>> PIPs. >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ). >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This PIP is a parent PIP. It aims to gradually >>>> solve >>>>>>>> (using >>>>>>>>>>>> sub-PIPs) >>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>> the current metric system's problems and >>>> provide the >>>>>>>>> ability >>>>>>>>>> to >>>>>>>>>>>>>>>>>> monitor a >>>>>>>>>>>>>>>>>>>> broker with a large topic count, which is >>>> currently >>>>>>>>> lacking. >>>>>>>>>>> As a >>>>>>>>>>>>>>>>>> parent >>>>>>>>>>>>>>>>>>>> PIP, it will describe each problem and its >>>> solution >>>>>> at >>>>>>> a >>>>>>>>> high >>>>>>>>>>>> level, >>>>>>>>>>>>>>>>>>>> leaving fine-grained details to the sub-PIPs. >>>> The >>>>>>> parent >>>>>>>>> PIP >>>>>>>>>>>> ensures >>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>> solutions align and does not contradict each >>>> other. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The basic building block to solve the monitoring >>>>>>> ability >>>>>>>> of >>>>>>>>>>> large >>>>>>>>>>>>>>>>>> topic >>>>>>>>>>>>>>>>>>>> count is aggregating internally (to topic >>>> groups) and >>>>>>>>> adding >>>>>>>>>>>>>>>>>> fine-grained >>>>>>>>>>>>>>>>>>>> filtering. We could have shoe-horned it into the >>>>>>> existing >>>>>>>>>>> metric >>>>>>>>>>>>>>>>>> system, >>>>>>>>>>>>>>>>>>>> but we thought adding that to a system already >>>>>>> ingrained >>>>>>>>> with >>>>>>>>>>>> many >>>>>>>>>>>>>>>>>> problems >>>>>>>>>>>>>>>>>>>> would be wrong and hard to do gradually, as so >>>> many >>>>>>>> things >>>>>>>>>> will >>>>>>>>>>>>>>>>>> break. This >>>>>>>>>>>>>>>>>>>> is why the second-biggest design decision >>>> presented >>>>>>> here >>>>>>>> is >>>>>>>>>>>>>>>>>> consolidating >>>>>>>>>>>>>>>>>>>> all existing metric libraries into a single one >>>> - >>>>>>>>>> OpenTelemetry >>>>>>>>>>>>>>>>>>>> <https://opentelemetry.io/>. The parent PIP >>>> will >>>>>>> explain >>>>>>>>> why >>>>>>>>>>>>>>>>>>>> OpenTelemetry was chosen out of existing >>>> solutions >>>>>> and >>>>>>>> why >>>>>>>>> it >>>>>>>>>>> far >>>>>>>>>>>>>>>>>> exceeds >>>>>>>>>>>>>>>>>>>> all other options. I’ve been working closely >>>> with the >>>>>>>>>>>> OpenTelemetry >>>>>>>>>>>>>>>>>>>> community in the past eight months: >>>> brain-storming >>>>>> this >>>>>>>>>>>> integration, >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> raising issues, in an effort to remove serious >>>>>> blockers >>>>>>>> to >>>>>>>>>> make >>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>>> migration successful. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I made every effort to summarize this document >>>> so >>>>>> that >>>>>>> it >>>>>>>>> can >>>>>>>>>>> be >>>>>>>>>>>>>>>>>> concise >>>>>>>>>>>>>>>>>>>> yet clear. I understand it is an effort to read >>>> it >>>>>> and, >>>>>>>>> more >>>>>>>>>>> so, >>>>>>>>>>>>>>>>>> provide >>>>>>>>>>>>>>>>>>>> meaningful feedback on such a large document; >>>> hence >>>>>> I’m >>>>>>>>> very >>>>>>>>>>>> grateful >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>> each individual who does so. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I think this design will help improve the user >>>>>>> experience >>>>>>>>>>>> immensely, >>>>>>>>>>>>>>>>>> so it >>>>>>>>>>>>>>>>>>>> is worth the time spent reading it. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> === QUOTE END === >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Asaf Mesika >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>>