+1 Enrico Olivelli <eolive...@gmail.com>于2024年3月3日 周日16:58写道:
> I support the initiative > > Lgtm > > > Thanks > Enrico > > Enrico > > Il Dom 3 Mar 2024, 04:09 Matteo Merli <matteo.me...@gmail.com> ha scritto: > > > PIP PR: https://github.com/apache/pulsar/pull/22178 > > > > WIP of proposed implementation: > > https://github.com/apache/pulsar/pull/22179 > > > > -------------------- > > > > # PIP 342: Support OpenTelemetry metrics in Pulsar client > > > > ## Motivation > > > > Current support for metric instrumentation in Pulsar client is very > limited > > and poses a lot of > > issues for integrating the metrics into any telemetry system. > > > > We have 2 ways that metrics are exposed today: > > > > 1. Printing logs every 1 minute: While this is ok as it comes out of the > > box, it's very hard for > > any application to get the data or use it in any meaningful way. > > 2. `producer.getStats()` or `consumer.getStats()`: Calling these methods > > will get access to > > the rate of events in the last 1-minute interval. This is problematic > > because out of the > > box the metrics are not collected anywhere. One would have to start > its > > own thread to > > periodically check these values and export them to some other system. > > > > Neither of these mechanism that we have today are sufficient to enable > > application to easily > > export the telemetry data of Pulsar client SDK. > > > > ## Goal > > > > Provide a good way for applications to retrieve and analyze the usage of > > Pulsar client operation, > > in particular with respect to: > > > > 1. Maximizing compatibility with existing telemetry systems > > 2. Minimizing the effort required to export these metrics > > > > ## Why OpenTelemetry? > > > > [OpenTelemetry](https://opentelemetry.io/) is quickly becoming the > > de-facto > > standard API for metric and > > tracing instrumentation. In fact, as part of [PIP-264]( > > https://github.com/apache/pulsar/blob/master/pip/pip-264.md), > > we are already migrating the Pulsar server side metrics to use > > OpenTelemetry. > > > > For Pulsar client SDK, we need to provide a similar way for application > > builder to quickly integrate and > > export Pulsar metrics. > > > > ### Why exposing OpenTelemetry directly in Pulsar API > > > > When deciding how to expose the metrics exporter configuration there are > > multiple options: > > > > 1. Accept an `OpenTelemetry` object directly in Pulsar API > > 2. Build a pluggable interface that describe all the Pulsar client SDK > > events and allow application to > > provide an implementation, perhaps providing an OpenTelemetry > included > > option. > > > > For this proposal, we are following the (1) option. Here are the reasons: > > > > 1. In a way, OpenTelemetry can be compared to [SLF4J]( > > https://www.slf4j.org/), in the sense that it provides an API > > on top of which different vendor can build multiple implementations. > > Therefore, there is no need to create a new > > Pulsar-specific interface > > 2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of > > Pulsar client, we will only depend on its > > API. Applications that are going to use OpenTelemetry, will include > the > > OTel SDK > > 3. Providing a custom interface has several drawbacks: > > 1. Applications need to update their implementations every time a > new > > metric is added in Pulsar SDK > > 2. The surface of this plugin API can become quite big when there > are > > several metrics > > 3. If we imagine an application that uses multiple libraries, like > > Pulsar SDK, and each of these has its own > > custom way to expose metrics, we can see the level of integration > > burden that is pushed to application > > developers > > 4. It will always be easy to use OpenTelemetry to collect the metrics > and > > export them using a custom metrics API. There > > are several examples of this in OpenTelemetry documentation. > > > > ## Public API changes > > > > ### Enabling OpenTelemetry > > > > When building a `PulsarClient` instance, it will be possible to pass an > > `OpenTelemetry` object: > > > > ```java > > interface ClientBuilder { > > // ... > > ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry > > openTelemetry); > > > > ClientBuilder openTelemetryMetricsCardinality(MetricsCardinality > > metricsCardinality); > > } > > ``` > > > > The common usage for an application would be something like: > > > > ```java > > // Creates a OpenTelemetry instance using environment variables to > > configure it > > OpenTelemetry otel=AutoConfiguredOpenTelemetrySdk.builder() > > .build().getOpenTelemetrySdk(); > > > > PulsarClient client=PulsarClient.builder() > > .serviceUrl("pulsar://localhost:6650") > > .build(); > > > > // .... > > ``` > > > > Cardinality enum will allow to select a default cardinality label to be > > attached to the > > metrics: > > > > ```java > > public enum MetricsCardinality { > > /** > > * Do not add additional labels to metrics > > */ > > None, > > > > /** > > * Label metrics by tenant > > */ > > Tenant, > > > > /** > > * Label metrics by tenant and namespace > > */ > > Namespace, > > > > /** > > * Label metrics by topic > > */ > > Topic, > > > > /** > > * Label metrics by each partition > > */ > > Partition, > > } > > ``` > > > > The labels are addictive. For example, selecting `Topic` level would mean > > that the metrics will be > > labeled like: > > > > ``` > > > > > pulsar_client_received_total{namespace="public/default",tenant="public",topic="persistent://public/default/pt"} > > 149.0 > > ``` > > > > While selecting `Namespace` level: > > > > ``` > > pulsar_client_received_total{namespace="public/default",tenant="public"} > > 149.0 > > ``` > > > > ### Deprecating the old stats methods > > > > The old way of collecting stats will be disabled by default, deprecated > and > > eventually removed > > in Pulsar 4.0 release. > > > > Methods to deprecate: > > > > ```java > > interface ClientBuilder { > > // ... > > @Deprecated > > ClientBuilder statsInterval(long statsInterval, TimeUnit unit); > > } > > > > interface Producer { > > @Deprecated > > ProducerStats getStats(); > > } > > > > interface Consumer { > > @Deprecated > > ConsumerStats getStats(); > > } > > ``` > > > > ## Initial set of metrics to include > > > > Based on the experience of Pulsar Go client SDK metrics ( > > see: > > > > > https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go > > ), > > this is the proposed initial set of metrics to export. > > > > Additional metrics could be added later on, though it's better to start > > with the set of most important metrics > > and then evaluate any missing information. > > > > | OTel metric name | Type | Unit > > | Description > > | > > > > > |-------------------------------------------------|-----------|-------------|------------------------------------------------------------------------------------------------| > > | `pulsar.client.connections.opened` | Counter | > connections > > | Counter of connections opened > > | > > | `pulsar.client.connections.closed` | Counter | > connections > > | Counter of connections closed > > | > > | `pulsar.client.connections.failed` | Counter | > connections > > | Counter of connections establishment failures > > | > > | `pulsar.client.session.opened` | Counter | sessions > > | Counter of sessions opened. `type="producer"` or `consumer` > > | > > | `pulsar.client.session.closed` | Counter | sessions > > | Counter of sessions closed. `type="producer"` or `consumer` > > | > > | `pulsar.client.received` | Counter | messages > > | Number of messages received > > | > > | `pulsar.client.received` | Counter | bytes > > | Number of bytes received > > | > > | `pulsar.client.consumer.preteched.messages` | Gauge | messages > > | Number of messages currently sitting in the consumer pre-fetch queue > > | > > | `pulsar.client.consumer.preteched` | Gauge | bytes > > | Total number of bytes currently sitting in the consumer pre-fetch queue > > | > > | `pulsar.client.consumer.ack` | Counter | messages > > | Number of ack operations > > | > > | `pulsar.client.consumer.nack` | Counter | messages > > | Number of negative ack operations > > | > > | `pulsar.client.consumer.dlq` | Counter | messages > > | Number of messages sent to DLQ > > | > > | `pulsar.client.consumer.ack.timeout` | Counter | messages > > | Number of ack timeouts events > > | > > | `pulsar.client.producer.latency` | Histogram | seconds > > | Publish latency experienced by the application, includes client > batching > > time | > > | `pulsar.client.producer.rpc.latency` | Histogram | seconds > > | Publish RPC latency experienced internally by the client when sending > > data to receiving an ack | > > | `pulsar.client.producer.published` | Counter | bytes > > | Bytes published > > | > > | `pulsar.client.producer.pending.messages.count` | Gauge | messages > > | Pending messages for this producer > > | > > | `pulsar.client.producer.pending.count` | Gauge | bytes > > | Pending bytes for this producer > > | > > > > Topic lookup metric will be differentiated by the lookup type label and > by > > the lookup transport > > mechanism (`transport-type="binary|http"`): > > > > | OTel metric name | Type | Unit | > > Description | > > > > > |--------------------------------------------|-----------|---------|--------------------------------------------------| > > | `pulsar.client.lookup{type="topic"}` | Histogram | seconds | > > Counter of topic lookup operations | > > | `pulsar.client.lookup{type="metadata"}` | Histogram | seconds | > > Counter of topic partitioned metadata operations | > > | `pulsar.client.lookup{type="schema"}` | Histogram | seconds | > > Counter of schema retrieval operations | > > | `pulsar.client.lookup{type="list-topics"}` | Histogram | seconds | > > Counter of namespace list topics operations | > > > > Additionally, all the histograms will have a `success=true|false` label > to > > distinguish successful and failed > > operations. > > > > > > > > > > -- > > Matteo Merli > > <matteo.me...@gmail.com> > > >