In my opinion, the performance issue reported is to critical that it should
be considered a regression. The only option is to either revert the initial
change or fix it as soon as possible.

The backward compatibility issue introduced to this fix, on the other hand,
doesn't impact the behavior of the broker other than the reported json from
the stats call. It's a much lesser evil than broker running out of memory
with tens of millions of ProducerStatsImpl as I have seen.

I would prefer deprecating aggregatePublisherStatsByProducerName, but at
the same time I see the reason behind keeping the feature intact in a minor
version change. Nozimo's proposal of
forceAggregatePublisherStatsByProducerName is basically keeping*
aggregatePublisherStatsByProducerName=false
*an option right? What if we change aggregatePublisherStatsByProducerName
to default to true, and logs a warning if it's set to false?

On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <nkuri...@apache.org>
wrote:

> Hi Heesung,
>
> Thank you for opening this discussion.
>
> IMO backward-compatibility should be kept at least across minor releases.
>
> Although the performance issue you mentioned (e.g. memory burst, high GC
> and OOM) looks problematic, backward-compatibility is also important.
> I think which has higher priority depends on the case.
>
> Your current change seems to remove the index-based aggregation completely.
> However I think we should keep room for choice.
>
> In order to allow users (here Pulsar server-side admins in particular) to
> choose the performance or backward-compatibility, how about introducing a
> "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
>
> Those who place more importance on the performance than
> backward-compatibility can set this flag to true.
> Others, those who want to keep backward-compatibility, set this flag to
> false.
>
> By the way, I'm not sure, is the producer name generation logic already
> implemented in C++, Go and other clients?
> If not so, first we should implement it before switching the
> producer-name-based aggregation.
>
> Best Regards,
> Nozomi
>
> 2022年11月17日(木) 8:51 Heesung Sohn <heesung.s...@streamnative.io.invalid>:
>
> > Hi,
> >
> > To add more about the backward incompatibility issue
> > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>,
> >
> > Before fix:
> > % ./bin/pulsar-admin topics partitioned-stats
> > persistent://public/default/pt
> > ...
> >   "publishers" : [ {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : false
> >   } ],
> >
> > After fix:
> > % ./bin/pulsar-admin topics partitioned-stats
> > persistent://public/default/pt
> > ...
> >   "publishers" : [ {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : true,
> >     "producerName" : "standalone-0-1"
> >   }, {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : true,
> >     "producerName" : "standalone-0-0"
> >   } ],
> > ...
> >
> >
> > The broker side's producer name generation has been there since this PR
> > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > ProducerName was automatically generated in this format
> > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} until
> > 2.10.
> > So, a producer to a partitioned topic(2) results in the two producer
> names,
> > like the following.
> > standalone-0-0
> > standalone-0-1
> >
> > And since 2.10, by default, partitioned producers have the same producer
> > name between partitions by this PR
> > <https://github.com/apache/pulsar/pull/10279>(generated on the client
> > side).
> > standalone-0-0
> >
> > Hence, the impacted versions(backward incompatibility issue
> > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>)
> by
> > the proposed fix(Option 1 below) are < 2.10.
> >
> > When we aggregate stats between partitions, the default(
> > aggregatePublisherStatsByProducerName=false) aggregates the producer
> stats
> > by the index of the producer stat list from each partition. So, when
> lucky,
> > it could output a single producer stat. However, this method can be
> buggy,
> > as each partition could return a different size and index of the
> > producer stat list.
> >
> >
> > To fix the original issue(described here
> > <https://github.com/apache/pulsar/pull/18254>), I think we have the
> > following options.
> >
> > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName (the
> > current PR here <https://github.com/apache/pulsar/pull/18254>) in the
> next
> > release and live with behavior change where we get `topics
> > partitioned-stats` per-producer-and-partition from old clients(Ver.
> <2.10),
> > instead of stats per-producer.
> >
> > Option 2: Defer the Option 1 fix and push it to the next major
> > version(3.0.0), as this is a breaking change.
> >
> > Option 3: Keep aggregatePublisherStatsByProducerName config but change
> the
> > default to aggregatePublisherStatsByProducerName=true.
> >
> > Option 4: As a long-term fix, create separate Admin-APIs for publisher
> and
> > subscription stats and drop their stats from `topics partitioned-stats`
> as
> > it is expensive to aggregate them on the fly. (for thousands of
> publishers
> > and subscriptions). Push this change to the next major version.
> >
> > Or other suggestions?
> >
> > Regards,
> > Heesung
> >
> >
> >
> >
> >
> >
> > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> heesung.s...@streamnative.io
> > >
> > wrote:
> >
> > > Dear Pulsar Community,
> > >
> > > We recently found a bug in `pulsar-admin topics partitioned-stats api`
> > that
> > > could incur a memory burst, high GC time, or OOM.
> > >
> > > For this issue, I proposed a fix
> > > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> > aggregatePublisherStatsByProducerName
> > > config and always aggregating the publishers' stats by publisherName,
> > > instead of the list index(aggregatePublisherStatsByProducerName=false,
> > > default).
> > >
> > >
> > >    -  The index-based aggregation is inherently wrong in a highly
> > >    concurrent producer environment(where the order and size of the
> > publisher
> > >    stat list are not guaranteed to be the same). The publisher stats
> > need to
> > >    be aggregated by a unique key, preferably the producer
> > >    name(aggregatePublisherStatsByProducerName=true).
> > >
> > >
> > > However, this fix will break some of the old client's compatibility
> since
> > > the way Pulsar generates the producer name has changed over time, as
> > > described here
> > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>.
> > >
> > > As I replied here
> > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363>,
> > although
> > > it is not desirable, I think we could be lenient on this change in the
> > stat
> > > API response(assuming thispublishers'stat struct is used for human
> admins
> > > only for ad-hoc checks).
> > >
> > > Are we OK with this non-backward-compatible fix for some of the old
> > > clients? Or, do you have any other suggestions?
> > >
> > > One idea for a long-term fix could be:
> > > When there are thousands of producers(consumers) for a
> > > (partitioned-)topic, it is expensive to aggregate each
> > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > Alternatively,
> > > for the next major version, I think we could further define
> > > producers(subscriptions)' API like the below and drop the publishers
> and
> > > subscriptions structs from topics (partitioned-)stats returns.
> > >
> > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > pulsar-admin publishers stats my-producer
> > >
> > > # similarly for subscriptions
> > >
> > > Regards,
> > > Heesung
> > >
> >
>

Reply via email to