I have updated the proposal (and the WIP PR) based on the very helpful feedback received. I will start a VOTE thread for this PIP.
Thanks. -- Matteo Merli <matteo.me...@gmail.com> On Sun, Mar 3, 2024 at 5:14 AM 太上玄元道君 <dao...@apache.org> wrote: > +1 > > Enrico Olivelli <eolive...@gmail.com>于2024年3月3日 周日16:58写道: > > > I support the initiative > > > > Lgtm > > > > > > Thanks > > Enrico > > > > Enrico > > > > Il Dom 3 Mar 2024, 04:09 Matteo Merli <matteo.me...@gmail.com> ha > scritto: > > > > > PIP PR: https://github.com/apache/pulsar/pull/22178 > > > > > > WIP of proposed implementation: > > > https://github.com/apache/pulsar/pull/22179 > > > > > > -------------------- > > > > > > # PIP 342: Support OpenTelemetry metrics in Pulsar client > > > > > > ## Motivation > > > > > > Current support for metric instrumentation in Pulsar client is very > > limited > > > and poses a lot of > > > issues for integrating the metrics into any telemetry system. > > > > > > We have 2 ways that metrics are exposed today: > > > > > > 1. Printing logs every 1 minute: While this is ok as it comes out of > the > > > box, it's very hard for > > > any application to get the data or use it in any meaningful way. > > > 2. `producer.getStats()` or `consumer.getStats()`: Calling these > methods > > > will get access to > > > the rate of events in the last 1-minute interval. This is > problematic > > > because out of the > > > box the metrics are not collected anywhere. One would have to start > > its > > > own thread to > > > periodically check these values and export them to some other > system. > > > > > > Neither of these mechanism that we have today are sufficient to enable > > > application to easily > > > export the telemetry data of Pulsar client SDK. > > > > > > ## Goal > > > > > > Provide a good way for applications to retrieve and analyze the usage > of > > > Pulsar client operation, > > > in particular with respect to: > > > > > > 1. Maximizing compatibility with existing telemetry systems > > > 2. Minimizing the effort required to export these metrics > > > > > > ## Why OpenTelemetry? > > > > > > [OpenTelemetry](https://opentelemetry.io/) is quickly becoming the > > > de-facto > > > standard API for metric and > > > tracing instrumentation. In fact, as part of [PIP-264]( > > > https://github.com/apache/pulsar/blob/master/pip/pip-264.md), > > > we are already migrating the Pulsar server side metrics to use > > > OpenTelemetry. > > > > > > For Pulsar client SDK, we need to provide a similar way for application > > > builder to quickly integrate and > > > export Pulsar metrics. > > > > > > ### Why exposing OpenTelemetry directly in Pulsar API > > > > > > When deciding how to expose the metrics exporter configuration there > are > > > multiple options: > > > > > > 1. Accept an `OpenTelemetry` object directly in Pulsar API > > > 2. Build a pluggable interface that describe all the Pulsar client SDK > > > events and allow application to > > > provide an implementation, perhaps providing an OpenTelemetry > > included > > > option. > > > > > > For this proposal, we are following the (1) option. Here are the > reasons: > > > > > > 1. In a way, OpenTelemetry can be compared to [SLF4J]( > > > https://www.slf4j.org/), in the sense that it provides an API > > > on top of which different vendor can build multiple > implementations. > > > Therefore, there is no need to create a new > > > Pulsar-specific interface > > > 2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of > > > Pulsar client, we will only depend on its > > > API. Applications that are going to use OpenTelemetry, will include > > the > > > OTel SDK > > > 3. Providing a custom interface has several drawbacks: > > > 1. Applications need to update their implementations every time a > > new > > > metric is added in Pulsar SDK > > > 2. The surface of this plugin API can become quite big when there > > are > > > several metrics > > > 3. If we imagine an application that uses multiple libraries, like > > > Pulsar SDK, and each of these has its own > > > custom way to expose metrics, we can see the level of > integration > > > burden that is pushed to application > > > developers > > > 4. It will always be easy to use OpenTelemetry to collect the metrics > > and > > > export them using a custom metrics API. There > > > are several examples of this in OpenTelemetry documentation. > > > > > > ## Public API changes > > > > > > ### Enabling OpenTelemetry > > > > > > When building a `PulsarClient` instance, it will be possible to pass an > > > `OpenTelemetry` object: > > > > > > ```java > > > interface ClientBuilder { > > > // ... > > > ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry > > > openTelemetry); > > > > > > ClientBuilder openTelemetryMetricsCardinality(MetricsCardinality > > > metricsCardinality); > > > } > > > ``` > > > > > > The common usage for an application would be something like: > > > > > > ```java > > > // Creates a OpenTelemetry instance using environment variables to > > > configure it > > > OpenTelemetry otel=AutoConfiguredOpenTelemetrySdk.builder() > > > .build().getOpenTelemetrySdk(); > > > > > > PulsarClient client=PulsarClient.builder() > > > .serviceUrl("pulsar://localhost:6650") > > > .build(); > > > > > > // .... > > > ``` > > > > > > Cardinality enum will allow to select a default cardinality label to be > > > attached to the > > > metrics: > > > > > > ```java > > > public enum MetricsCardinality { > > > /** > > > * Do not add additional labels to metrics > > > */ > > > None, > > > > > > /** > > > * Label metrics by tenant > > > */ > > > Tenant, > > > > > > /** > > > * Label metrics by tenant and namespace > > > */ > > > Namespace, > > > > > > /** > > > * Label metrics by topic > > > */ > > > Topic, > > > > > > /** > > > * Label metrics by each partition > > > */ > > > Partition, > > > } > > > ``` > > > > > > The labels are addictive. For example, selecting `Topic` level would > mean > > > that the metrics will be > > > labeled like: > > > > > > ``` > > > > > > > > > pulsar_client_received_total{namespace="public/default",tenant="public",topic="persistent://public/default/pt"} > > > 149.0 > > > ``` > > > > > > While selecting `Namespace` level: > > > > > > ``` > > > > pulsar_client_received_total{namespace="public/default",tenant="public"} > > > 149.0 > > > ``` > > > > > > ### Deprecating the old stats methods > > > > > > The old way of collecting stats will be disabled by default, deprecated > > and > > > eventually removed > > > in Pulsar 4.0 release. > > > > > > Methods to deprecate: > > > > > > ```java > > > interface ClientBuilder { > > > // ... > > > @Deprecated > > > ClientBuilder statsInterval(long statsInterval, TimeUnit unit); > > > } > > > > > > interface Producer { > > > @Deprecated > > > ProducerStats getStats(); > > > } > > > > > > interface Consumer { > > > @Deprecated > > > ConsumerStats getStats(); > > > } > > > ``` > > > > > > ## Initial set of metrics to include > > > > > > Based on the experience of Pulsar Go client SDK metrics ( > > > see: > > > > > > > > > https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go > > > ), > > > this is the proposed initial set of metrics to export. > > > > > > Additional metrics could be added later on, though it's better to start > > > with the set of most important metrics > > > and then evaluate any missing information. > > > > > > | OTel metric name | Type | Unit > > > | Description > > > | > > > > > > > > > |-------------------------------------------------|-----------|-------------|------------------------------------------------------------------------------------------------| > > > | `pulsar.client.connections.opened` | Counter | > > connections > > > | Counter of connections opened > > > | > > > | `pulsar.client.connections.closed` | Counter | > > connections > > > | Counter of connections closed > > > | > > > | `pulsar.client.connections.failed` | Counter | > > connections > > > | Counter of connections establishment failures > > > | > > > | `pulsar.client.session.opened` | Counter | > sessions > > > | Counter of sessions opened. `type="producer"` or `consumer` > > > | > > > | `pulsar.client.session.closed` | Counter | > sessions > > > | Counter of sessions closed. `type="producer"` or `consumer` > > > | > > > | `pulsar.client.received` | Counter | > messages > > > | Number of messages received > > > | > > > | `pulsar.client.received` | Counter | bytes > > > | Number of bytes received > > > | > > > | `pulsar.client.consumer.preteched.messages` | Gauge | > messages > > > | Number of messages currently sitting in the consumer pre-fetch queue > > > | > > > | `pulsar.client.consumer.preteched` | Gauge | bytes > > > | Total number of bytes currently sitting in the consumer pre-fetch > queue > > > | > > > | `pulsar.client.consumer.ack` | Counter | > messages > > > | Number of ack operations > > > | > > > | `pulsar.client.consumer.nack` | Counter | > messages > > > | Number of negative ack operations > > > | > > > | `pulsar.client.consumer.dlq` | Counter | > messages > > > | Number of messages sent to DLQ > > > | > > > | `pulsar.client.consumer.ack.timeout` | Counter | > messages > > > | Number of ack timeouts events > > > | > > > | `pulsar.client.producer.latency` | Histogram | seconds > > > | Publish latency experienced by the application, includes client > > batching > > > time | > > > | `pulsar.client.producer.rpc.latency` | Histogram | seconds > > > | Publish RPC latency experienced internally by the client when sending > > > data to receiving an ack | > > > | `pulsar.client.producer.published` | Counter | bytes > > > | Bytes published > > > | > > > | `pulsar.client.producer.pending.messages.count` | Gauge | > messages > > > | Pending messages for this producer > > > | > > > | `pulsar.client.producer.pending.count` | Gauge | bytes > > > | Pending bytes for this producer > > > | > > > > > > Topic lookup metric will be differentiated by the lookup type label and > > by > > > the lookup transport > > > mechanism (`transport-type="binary|http"`): > > > > > > | OTel metric name | Type | Unit | > > > Description | > > > > > > > > > |--------------------------------------------|-----------|---------|--------------------------------------------------| > > > | `pulsar.client.lookup{type="topic"}` | Histogram | seconds | > > > Counter of topic lookup operations | > > > | `pulsar.client.lookup{type="metadata"}` | Histogram | seconds | > > > Counter of topic partitioned metadata operations | > > > | `pulsar.client.lookup{type="schema"}` | Histogram | seconds | > > > Counter of schema retrieval operations | > > > | `pulsar.client.lookup{type="list-topics"}` | Histogram | seconds | > > > Counter of namespace list topics operations | > > > > > > Additionally, all the histograms will have a `success=true|false` label > > to > > > distinguish successful and failed > > > operations. > > > > > > > > > > > > > > > -- > > > Matteo Merli > > > <matteo.me...@gmail.com> > > > > > >