Hi,

To add more about the backward incompatibility issue
<https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>,

Before fix:
% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : false
  } ],

After fix:
% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : true,
    "producerName" : "standalone-0-1"
  }, {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : true,
    "producerName" : "standalone-0-0"
  } ],
...


The broker side's producer name generation has been there since this PR
<https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
ProducerName was automatically generated in this format
{clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} until 2.10.
So, a producer to a partitioned topic(2) results in the two producer names,
like the following.
standalone-0-0
standalone-0-1

And since 2.10, by default, partitioned producers have the same producer
name between partitions by this PR
<https://github.com/apache/pulsar/pull/10279>(generated on the client side).
standalone-0-0

Hence, the impacted versions(backward incompatibility issue
<https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>) by
the proposed fix(Option 1 below) are < 2.10.

When we aggregate stats between partitions, the default(
aggregatePublisherStatsByProducerName=false) aggregates the producer stats
by the index of the producer stat list from each partition. So, when lucky,
it could output a single producer stat. However, this method can be buggy,
as each partition could return a different size and index of the
producer stat list.


To fix the original issue(described here
<https://github.com/apache/pulsar/pull/18254>), I think we have the
following options.

Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName (the
current PR here <https://github.com/apache/pulsar/pull/18254>) in the next
release and live with behavior change where we get `topics
partitioned-stats` per-producer-and-partition from old clients(Ver. <2.10),
instead of stats per-producer.

Option 2: Defer the Option 1 fix and push it to the next major
version(3.0.0), as this is a breaking change.

Option 3: Keep aggregatePublisherStatsByProducerName config but change the
default to aggregatePublisherStatsByProducerName=true.

Option 4: As a long-term fix, create separate Admin-APIs for publisher and
subscription stats and drop their stats from `topics partitioned-stats` as
it is expensive to aggregate them on the fly. (for thousands of publishers
and subscriptions). Push this change to the next major version.

Or other suggestions?

Regards,
Heesung






On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <heesung.s...@streamnative.io>
wrote:

> Dear Pulsar Community,
>
> We recently found a bug in `pulsar-admin topics partitioned-stats api` that
> could incur a memory burst, high GC time, or OOM.
>
> For this issue, I proposed a fix
> <https://github.com/apache/pulsar/pull/18254> by deprecating the 
> aggregatePublisherStatsByProducerName
> config and always aggregating the publishers' stats by publisherName,
> instead of the list index(aggregatePublisherStatsByProducerName=false,
> default).
>
>
>    -  The index-based aggregation is inherently wrong in a highly
>    concurrent producer environment(where the order and size of the publisher
>    stat list are not guaranteed to be the same). The publisher stats need to
>    be aggregated by a unique key, preferably the producer
>    name(aggregatePublisherStatsByProducerName=true).
>
>
> However, this fix will break some of the old client's compatibility since
> the way Pulsar generates the producer name has changed over time, as
> described here
> <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>.
>
> As I replied here
> <https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363>, 
> although
> it is not desirable, I think we could be lenient on this change in the stat
> API response(assuming thispublishers'stat struct is used for human admins
> only for ad-hoc checks).
>
> Are we OK with this non-backward-compatible fix for some of the old
> clients? Or, do you have any other suggestions?
>
> One idea for a long-term fix could be:
> When there are thousands of producers(consumers) for a
> (partitioned-)topic, it is expensive to aggregate each
> publisher(subscriptions)'s stats on-the-fly across the brokers. Alternatively,
> for the next major version, I think we could further define
> producers(subscriptions)' API like the below and drop the publishers and
> subscriptions structs from topics (partitioned-)stats returns.
>
> pulsar-admin publishers list my-topic --last-pagination-key xyz
> pulsar-admin publishers stats my-producer
>
> # similarly for subscriptions
>
> Regards,
> Heesung
>

Reply via email to