+1

Enrico Olivelli <eolive...@gmail.com>于2024年3月3日 周日16:58写道:

> I support the initiative
>
> Lgtm
>
>
> Thanks
> Enrico
>
> Enrico
>
> Il Dom 3 Mar 2024, 04:09 Matteo Merli <matteo.me...@gmail.com> ha scritto:
>
> > PIP PR: https://github.com/apache/pulsar/pull/22178
> >
> > WIP of proposed implementation:
> > https://github.com/apache/pulsar/pull/22179
> >
> > --------------------
> >
> > # PIP 342: Support OpenTelemetry metrics in Pulsar client
> >
> > ## Motivation
> >
> > Current support for metric instrumentation in Pulsar client is very
> limited
> > and poses a lot of
> > issues for integrating the metrics into any telemetry system.
> >
> > We have 2 ways that metrics are exposed today:
> >
> > 1. Printing logs every 1 minute: While this is ok as it comes out of the
> > box, it's very hard for
> >    any application to get the data or use it in any meaningful way.
> > 2. `producer.getStats()` or `consumer.getStats()`: Calling these methods
> > will get access to
> >    the rate of events in the last 1-minute interval. This is problematic
> > because out of the
> >    box the metrics are not collected anywhere. One would have to start
> its
> > own thread to
> >    periodically check these values and export them to some other system.
> >
> > Neither of these mechanism that we have today are sufficient to enable
> > application to easily
> > export the telemetry data of Pulsar client SDK.
> >
> > ## Goal
> >
> > Provide a good way for applications to retrieve and analyze the usage of
> > Pulsar client operation,
> > in particular with respect to:
> >
> > 1. Maximizing compatibility with existing telemetry systems
> > 2. Minimizing the effort required to export these metrics
> >
> > ## Why OpenTelemetry?
> >
> > [OpenTelemetry](https://opentelemetry.io/) is quickly becoming the
> > de-facto
> > standard API for metric and
> > tracing instrumentation. In fact, as part of [PIP-264](
> > https://github.com/apache/pulsar/blob/master/pip/pip-264.md),
> > we are already migrating the Pulsar server side metrics to use
> > OpenTelemetry.
> >
> > For Pulsar client SDK, we need to provide a similar way for application
> > builder to quickly integrate and
> > export Pulsar metrics.
> >
> > ### Why exposing OpenTelemetry directly in Pulsar API
> >
> > When deciding how to expose the metrics exporter configuration there are
> > multiple options:
> >
> >  1. Accept an `OpenTelemetry` object directly in Pulsar API
> >  2. Build a pluggable interface that describe all the Pulsar client SDK
> > events and allow application to
> >     provide an implementation, perhaps providing an OpenTelemetry
> included
> > option.
> >
> > For this proposal, we are following the (1) option. Here are the reasons:
> >
> >  1. In a way, OpenTelemetry can be compared to [SLF4J](
> > https://www.slf4j.org/), in the sense that it provides an API
> >     on top of which different vendor can build multiple implementations.
> > Therefore, there is no need to create a new
> >     Pulsar-specific interface
> >  2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of
> > Pulsar client, we will only depend on its
> >     API. Applications that are going to use OpenTelemetry, will include
> the
> > OTel SDK
> >  3. Providing a custom interface has several drawbacks:
> >      1. Applications need to update their implementations every time a
> new
> > metric is added in Pulsar SDK
> >      2. The surface of this plugin API can become quite big when there
> are
> > several metrics
> >      3. If we imagine an application that uses multiple libraries, like
> > Pulsar SDK, and each of these has its own
> >         custom way to expose metrics, we can see the level of integration
> > burden that is pushed to application
> >         developers
> >  4. It will always be easy to use OpenTelemetry to collect the metrics
> and
> > export them using a custom metrics API. There
> >     are several examples of this in OpenTelemetry documentation.
> >
> > ## Public API changes
> >
> > ### Enabling OpenTelemetry
> >
> > When building a `PulsarClient` instance, it will be possible to pass an
> > `OpenTelemetry` object:
> >
> > ```java
> > interface ClientBuilder {
> >     // ...
> >     ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry
> > openTelemetry);
> >
> >     ClientBuilder openTelemetryMetricsCardinality(MetricsCardinality
> > metricsCardinality);
> > }
> > ```
> >
> > The common usage for an application would be something like:
> >
> > ```java
> > // Creates a OpenTelemetry instance using environment variables to
> > configure it
> > OpenTelemetry otel=AutoConfiguredOpenTelemetrySdk.builder()
> >         .build().getOpenTelemetrySdk();
> >
> >         PulsarClient client=PulsarClient.builder()
> >         .serviceUrl("pulsar://localhost:6650")
> >         .build();
> >
> > // ....
> > ```
> >
> > Cardinality enum will allow to select a default cardinality label to be
> > attached to the
> > metrics:
> >
> > ```java
> > public enum MetricsCardinality {
> >     /**
> >      * Do not add additional labels to metrics
> >      */
> >     None,
> >
> >     /**
> >      * Label metrics by tenant
> >      */
> >     Tenant,
> >
> >     /**
> >      * Label metrics by tenant and namespace
> >      */
> >     Namespace,
> >
> >     /**
> >      * Label metrics by topic
> >      */
> >     Topic,
> >
> >     /**
> >      * Label metrics by each partition
> >      */
> >     Partition,
> > }
> > ```
> >
> > The labels are addictive. For example, selecting `Topic` level would mean
> > that the metrics will be
> > labeled like:
> >
> > ```
> >
> >
> pulsar_client_received_total{namespace="public/default",tenant="public",topic="persistent://public/default/pt"}
> > 149.0
> > ```
> >
> > While selecting `Namespace` level:
> >
> > ```
> > pulsar_client_received_total{namespace="public/default",tenant="public"}
> > 149.0
> > ```
> >
> > ### Deprecating the old stats methods
> >
> > The old way of collecting stats will be disabled by default, deprecated
> and
> > eventually removed
> > in Pulsar 4.0 release.
> >
> > Methods to deprecate:
> >
> > ```java
> > interface ClientBuilder {
> >     // ...
> >     @Deprecated
> >     ClientBuilder statsInterval(long statsInterval, TimeUnit unit);
> > }
> >
> > interface Producer {
> >     @Deprecated
> >     ProducerStats getStats();
> > }
> >
> > interface Consumer {
> >     @Deprecated
> >     ConsumerStats getStats();
> > }
> > ```
> >
> > ## Initial set of metrics to include
> >
> > Based on the experience of Pulsar Go client SDK metrics (
> > see:
> >
> >
> https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go
> > ),
> > this is the proposed initial set of metrics to export.
> >
> > Additional metrics could be added later on, though it's better to start
> > with the set of most important metrics
> > and then evaluate any missing information.
> >
> > | OTel metric name                                | Type      | Unit
> >  | Description
> >                        |
> >
> >
> |-------------------------------------------------|-----------|-------------|------------------------------------------------------------------------------------------------|
> > | `pulsar.client.connections.opened`              | Counter   |
> connections
> > | Counter of connections opened
> >                      |
> > | `pulsar.client.connections.closed`              | Counter   |
> connections
> > | Counter of connections closed
> >                      |
> > | `pulsar.client.connections.failed`              | Counter   |
> connections
> > | Counter of connections establishment failures
> >                      |
> > | `pulsar.client.session.opened`                  | Counter   | sessions
> >  | Counter of sessions opened. `type="producer"` or `consumer`
> >                        |
> > | `pulsar.client.session.closed`                  | Counter   | sessions
> >  | Counter of sessions closed. `type="producer"` or `consumer`
> >                        |
> > | `pulsar.client.received`                        | Counter   | messages
> >  | Number of messages received
> >                        |
> > | `pulsar.client.received`                        | Counter   | bytes
> > | Number of bytes received
> >                       |
> > | `pulsar.client.consumer.preteched.messages`     | Gauge     | messages
> >  | Number of messages currently sitting in the consumer pre-fetch queue
> >                       |
> > | `pulsar.client.consumer.preteched`              | Gauge     | bytes
> > | Total number of bytes currently sitting in the consumer pre-fetch queue
> >                      |
> > | `pulsar.client.consumer.ack`                    | Counter   | messages
> >  | Number of ack operations
> >                       |
> > | `pulsar.client.consumer.nack`                   | Counter   | messages
> >  | Number of negative ack operations
> >                        |
> > | `pulsar.client.consumer.dlq`                    | Counter   | messages
> >  | Number of messages sent to DLQ
> >                       |
> > | `pulsar.client.consumer.ack.timeout`            | Counter   | messages
> >  | Number of ack timeouts events
> >                        |
> > | `pulsar.client.producer.latency`                | Histogram | seconds
> > | Publish latency experienced by the application, includes client
> batching
> > time                  |
> > | `pulsar.client.producer.rpc.latency`            | Histogram | seconds
> > | Publish RPC latency experienced internally by the client when sending
> > data to receiving an ack |
> > | `pulsar.client.producer.published`              | Counter   | bytes
> > | Bytes published
> >                      |
> > | `pulsar.client.producer.pending.messages.count` | Gauge     | messages
> >  | Pending messages for this producer
> >                       |
> > | `pulsar.client.producer.pending.count`          | Gauge     | bytes
> > | Pending bytes for this producer
> >                      |
> >
> > Topic lookup metric will be differentiated by the lookup type label and
> by
> > the lookup transport
> > mechanism (`transport-type="binary|http"`):
> >
> > | OTel metric name                           | Type      | Unit    |
> > Description                                      |
> >
> >
> |--------------------------------------------|-----------|---------|--------------------------------------------------|
> > | `pulsar.client.lookup{type="topic"}`       | Histogram | seconds |
> > Counter of topic lookup operations               |
> > | `pulsar.client.lookup{type="metadata"}`    | Histogram | seconds |
> > Counter of topic partitioned metadata operations |
> > | `pulsar.client.lookup{type="schema"}`      | Histogram | seconds |
> > Counter of schema retrieval operations           |
> > | `pulsar.client.lookup{type="list-topics"}` | Histogram | seconds |
> > Counter of namespace list topics operations      |
> >
> > Additionally, all the histograms will have a `success=true|false` label
> to
> > distinguish successful and failed
> > operations.
> >
> >
> >
> >
> > --
> > Matteo Merli
> > <matteo.me...@gmail.com>
> >
>

Reply via email to