I support the initiative

Lgtm


Thanks
Enrico

Enrico

Il Dom 3 Mar 2024, 04:09 Matteo Merli <matteo.me...@gmail.com> ha scritto:

> PIP PR: https://github.com/apache/pulsar/pull/22178
>
> WIP of proposed implementation:
> https://github.com/apache/pulsar/pull/22179
>
> --------------------
>
> # PIP 342: Support OpenTelemetry metrics in Pulsar client
>
> ## Motivation
>
> Current support for metric instrumentation in Pulsar client is very limited
> and poses a lot of
> issues for integrating the metrics into any telemetry system.
>
> We have 2 ways that metrics are exposed today:
>
> 1. Printing logs every 1 minute: While this is ok as it comes out of the
> box, it's very hard for
>    any application to get the data or use it in any meaningful way.
> 2. `producer.getStats()` or `consumer.getStats()`: Calling these methods
> will get access to
>    the rate of events in the last 1-minute interval. This is problematic
> because out of the
>    box the metrics are not collected anywhere. One would have to start its
> own thread to
>    periodically check these values and export them to some other system.
>
> Neither of these mechanism that we have today are sufficient to enable
> application to easily
> export the telemetry data of Pulsar client SDK.
>
> ## Goal
>
> Provide a good way for applications to retrieve and analyze the usage of
> Pulsar client operation,
> in particular with respect to:
>
> 1. Maximizing compatibility with existing telemetry systems
> 2. Minimizing the effort required to export these metrics
>
> ## Why OpenTelemetry?
>
> [OpenTelemetry](https://opentelemetry.io/) is quickly becoming the
> de-facto
> standard API for metric and
> tracing instrumentation. In fact, as part of [PIP-264](
> https://github.com/apache/pulsar/blob/master/pip/pip-264.md),
> we are already migrating the Pulsar server side metrics to use
> OpenTelemetry.
>
> For Pulsar client SDK, we need to provide a similar way for application
> builder to quickly integrate and
> export Pulsar metrics.
>
> ### Why exposing OpenTelemetry directly in Pulsar API
>
> When deciding how to expose the metrics exporter configuration there are
> multiple options:
>
>  1. Accept an `OpenTelemetry` object directly in Pulsar API
>  2. Build a pluggable interface that describe all the Pulsar client SDK
> events and allow application to
>     provide an implementation, perhaps providing an OpenTelemetry included
> option.
>
> For this proposal, we are following the (1) option. Here are the reasons:
>
>  1. In a way, OpenTelemetry can be compared to [SLF4J](
> https://www.slf4j.org/), in the sense that it provides an API
>     on top of which different vendor can build multiple implementations.
> Therefore, there is no need to create a new
>     Pulsar-specific interface
>  2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of
> Pulsar client, we will only depend on its
>     API. Applications that are going to use OpenTelemetry, will include the
> OTel SDK
>  3. Providing a custom interface has several drawbacks:
>      1. Applications need to update their implementations every time a new
> metric is added in Pulsar SDK
>      2. The surface of this plugin API can become quite big when there are
> several metrics
>      3. If we imagine an application that uses multiple libraries, like
> Pulsar SDK, and each of these has its own
>         custom way to expose metrics, we can see the level of integration
> burden that is pushed to application
>         developers
>  4. It will always be easy to use OpenTelemetry to collect the metrics and
> export them using a custom metrics API. There
>     are several examples of this in OpenTelemetry documentation.
>
> ## Public API changes
>
> ### Enabling OpenTelemetry
>
> When building a `PulsarClient` instance, it will be possible to pass an
> `OpenTelemetry` object:
>
> ```java
> interface ClientBuilder {
>     // ...
>     ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry
> openTelemetry);
>
>     ClientBuilder openTelemetryMetricsCardinality(MetricsCardinality
> metricsCardinality);
> }
> ```
>
> The common usage for an application would be something like:
>
> ```java
> // Creates a OpenTelemetry instance using environment variables to
> configure it
> OpenTelemetry otel=AutoConfiguredOpenTelemetrySdk.builder()
>         .build().getOpenTelemetrySdk();
>
>         PulsarClient client=PulsarClient.builder()
>         .serviceUrl("pulsar://localhost:6650")
>         .build();
>
> // ....
> ```
>
> Cardinality enum will allow to select a default cardinality label to be
> attached to the
> metrics:
>
> ```java
> public enum MetricsCardinality {
>     /**
>      * Do not add additional labels to metrics
>      */
>     None,
>
>     /**
>      * Label metrics by tenant
>      */
>     Tenant,
>
>     /**
>      * Label metrics by tenant and namespace
>      */
>     Namespace,
>
>     /**
>      * Label metrics by topic
>      */
>     Topic,
>
>     /**
>      * Label metrics by each partition
>      */
>     Partition,
> }
> ```
>
> The labels are addictive. For example, selecting `Topic` level would mean
> that the metrics will be
> labeled like:
>
> ```
>
> pulsar_client_received_total{namespace="public/default",tenant="public",topic="persistent://public/default/pt"}
> 149.0
> ```
>
> While selecting `Namespace` level:
>
> ```
> pulsar_client_received_total{namespace="public/default",tenant="public"}
> 149.0
> ```
>
> ### Deprecating the old stats methods
>
> The old way of collecting stats will be disabled by default, deprecated and
> eventually removed
> in Pulsar 4.0 release.
>
> Methods to deprecate:
>
> ```java
> interface ClientBuilder {
>     // ...
>     @Deprecated
>     ClientBuilder statsInterval(long statsInterval, TimeUnit unit);
> }
>
> interface Producer {
>     @Deprecated
>     ProducerStats getStats();
> }
>
> interface Consumer {
>     @Deprecated
>     ConsumerStats getStats();
> }
> ```
>
> ## Initial set of metrics to include
>
> Based on the experience of Pulsar Go client SDK metrics (
> see:
>
> https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go
> ),
> this is the proposed initial set of metrics to export.
>
> Additional metrics could be added later on, though it's better to start
> with the set of most important metrics
> and then evaluate any missing information.
>
> | OTel metric name                                | Type      | Unit
>  | Description
>                        |
>
> |-------------------------------------------------|-----------|-------------|------------------------------------------------------------------------------------------------|
> | `pulsar.client.connections.opened`              | Counter   | connections
> | Counter of connections opened
>                      |
> | `pulsar.client.connections.closed`              | Counter   | connections
> | Counter of connections closed
>                      |
> | `pulsar.client.connections.failed`              | Counter   | connections
> | Counter of connections establishment failures
>                      |
> | `pulsar.client.session.opened`                  | Counter   | sessions
>  | Counter of sessions opened. `type="producer"` or `consumer`
>                        |
> | `pulsar.client.session.closed`                  | Counter   | sessions
>  | Counter of sessions closed. `type="producer"` or `consumer`
>                        |
> | `pulsar.client.received`                        | Counter   | messages
>  | Number of messages received
>                        |
> | `pulsar.client.received`                        | Counter   | bytes
> | Number of bytes received
>                       |
> | `pulsar.client.consumer.preteched.messages`     | Gauge     | messages
>  | Number of messages currently sitting in the consumer pre-fetch queue
>                       |
> | `pulsar.client.consumer.preteched`              | Gauge     | bytes
> | Total number of bytes currently sitting in the consumer pre-fetch queue
>                      |
> | `pulsar.client.consumer.ack`                    | Counter   | messages
>  | Number of ack operations
>                       |
> | `pulsar.client.consumer.nack`                   | Counter   | messages
>  | Number of negative ack operations
>                        |
> | `pulsar.client.consumer.dlq`                    | Counter   | messages
>  | Number of messages sent to DLQ
>                       |
> | `pulsar.client.consumer.ack.timeout`            | Counter   | messages
>  | Number of ack timeouts events
>                        |
> | `pulsar.client.producer.latency`                | Histogram | seconds
> | Publish latency experienced by the application, includes client batching
> time                  |
> | `pulsar.client.producer.rpc.latency`            | Histogram | seconds
> | Publish RPC latency experienced internally by the client when sending
> data to receiving an ack |
> | `pulsar.client.producer.published`              | Counter   | bytes
> | Bytes published
>                      |
> | `pulsar.client.producer.pending.messages.count` | Gauge     | messages
>  | Pending messages for this producer
>                       |
> | `pulsar.client.producer.pending.count`          | Gauge     | bytes
> | Pending bytes for this producer
>                      |
>
> Topic lookup metric will be differentiated by the lookup type label and by
> the lookup transport
> mechanism (`transport-type="binary|http"`):
>
> | OTel metric name                           | Type      | Unit    |
> Description                                      |
>
> |--------------------------------------------|-----------|---------|--------------------------------------------------|
> | `pulsar.client.lookup{type="topic"}`       | Histogram | seconds |
> Counter of topic lookup operations               |
> | `pulsar.client.lookup{type="metadata"}`    | Histogram | seconds |
> Counter of topic partitioned metadata operations |
> | `pulsar.client.lookup{type="schema"}`      | Histogram | seconds |
> Counter of schema retrieval operations           |
> | `pulsar.client.lookup{type="list-topics"}` | Histogram | seconds |
> Counter of namespace list topics operations      |
>
> Additionally, all the histograms will have a `success=true|false` label to
> distinguish successful and failed
> operations.
>
>
>
>
> --
> Matteo Merli
> <matteo.me...@gmail.com>
>

Reply via email to