Hi,

I dont think adding another config around this is a good idea. Pulsar
already has so many configs, and to me adding the config will further
complicate the system— this is anti to the “Invent and Simplify” principle.

Based on what we discussed, Pulsar cannot give up the index-based
aggregation due to the backward compatibility. Hence, I think we can update
the PR and only fix the bug, the nested for-loop producer stat obj
generation without changing any default configs. In this way, the memory
burst issue can be minimized at least. Plus, we can update the doc to warn
users.

Please let me know what you think.

Thanks,
Heesung

On Tue, Nov 29, 2022 at 10:02 PM Lin Zhao <lin.z...@streamnative.io.invalid>
wrote:

> Hi Nozomi,
>
> Thank you for the explanation. It's clearer to me now. Maybe instead of
> aggregatePublisherStatsByProducerName, name it allowStatsAggregationByIndex
> to be more descriptive. Defaults to false which makes this fix take effect.
>
> On Mon, Nov 28, 2022 at 9:36 PM Nozomi Kurihara <nkuri...@apache.org>
> wrote:
>
> > Hi Lin,
> >
> > > What if we change aggregatePublisherStatsByProducerName to default to
> > true, and logs a warning if it's set to false?
> >
> > Currently "aggregatePublisherStatsByProducerName=true" means that stats
> is
> > aggregated by NOT ONLY producer name BUT index when clients don't support
> > partial producer.
> >
> >
> https://github.com/apache/pulsar/blob/6a719480b2afc84f5acb85fedf3accf8861eb97a/conf/broker.conf#L1476-L1478
> >
> > Thus
> > > change aggregatePublisherStatsByProducerName to default to true
> > is not enough to avoid the performance issue since the index-based
> > aggregation can still work.
> >
> > To address this issue, https://github.com/apache/pulsar/pull/18254
> > completely removes the index-based aggregation logic.
> > However this change will cause another problem, breaking the backward
> > compatibility.
> > Though I agree that the impact is less than the performance issue, I
> think
> > we should keep the existing behavior too.
> >
> > That's why I proposed "force" setting by which users have 3 options:
> >
> > (1) force=true
> > * stats is aggregated by only producer name
> > * for those who want to avoid the performance issue
> >
> > (2) force=false & aggregatePublisherStatsByProducerName=true
> > * stats is aggregated by producer name for clients which supports partial
> > producer, otherwise by index
> > * for those who want to keep the backward-compatibility
> >
> > (3) force=false & aggregatePublisherStatsByProducerName=false
> > * stats is aggregated by only index
> > * (I'm not sure whether there are users who need this option)
> >
> > Best Regards,
> > Nozomi
> >
> > 2022年11月29日(火) 8:55 Lin Zhao <lin.z...@streamnative.io.invalid>:
> >
> > > In my opinion, the performance issue reported is to critical that it
> > should
> > > be considered a regression. The only option is to either revert the
> > initial
> > > change or fix it as soon as possible.
> > >
> > > The backward compatibility issue introduced to this fix, on the other
> > hand,
> > > doesn't impact the behavior of the broker other than the reported json
> > from
> > > the stats call. It's a much lesser evil than broker running out of
> memory
> > > with tens of millions of ProducerStatsImpl as I have seen.
> > >
> > > I would prefer deprecating aggregatePublisherStatsByProducerName, but
> at
> > > the same time I see the reason behind keeping the feature intact in a
> > minor
> > > version change. Nozimo's proposal of
> > > forceAggregatePublisherStatsByProducerName is basically keeping*
> > > aggregatePublisherStatsByProducerName=false
> > > *an option right? What if we change
> aggregatePublisherStatsByProducerName
> > > to default to true, and logs a warning if it's set to false?
> > >
> > > On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <nkuri...@apache.org>
> > > wrote:
> > >
> > > > Hi Heesung,
> > > >
> > > > Thank you for opening this discussion.
> > > >
> > > > IMO backward-compatibility should be kept at least across minor
> > releases.
> > > >
> > > > Although the performance issue you mentioned (e.g. memory burst, high
> > GC
> > > > and OOM) looks problematic, backward-compatibility is also important.
> > > > I think which has higher priority depends on the case.
> > > >
> > > > Your current change seems to remove the index-based aggregation
> > > completely.
> > > > However I think we should keep room for choice.
> > > >
> > > > In order to allow users (here Pulsar server-side admins in
> particular)
> > to
> > > > choose the performance or backward-compatibility, how about
> > introducing a
> > > > "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
> > > >
> > > > Those who place more importance on the performance than
> > > > backward-compatibility can set this flag to true.
> > > > Others, those who want to keep backward-compatibility, set this flag
> to
> > > > false.
> > > >
> > > > By the way, I'm not sure, is the producer name generation logic
> already
> > > > implemented in C++, Go and other clients?
> > > > If not so, first we should implement it before switching the
> > > > producer-name-based aggregation.
> > > >
> > > > Best Regards,
> > > > Nozomi
> > > >
> > > > 2022年11月17日(木) 8:51 Heesung Sohn <heesung.s...@streamnative.io
> > .invalid>:
> > > >
> > > > > Hi,
> > > > >
> > > > > To add more about the backward incompatibility issue
> > > > > <
> https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > >,
> > > > >
> > > > > Before fix:
> > > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > > persistent://public/default/pt
> > > > > ...
> > > > >   "publishers" : [ {
> > > > >     "msgRateIn" : 0.0,
> > > > >     "msgThroughputIn" : 0.0,
> > > > >     "averageMsgSize" : 0.0,
> > > > >     "chunkedMessageRate" : 0.0,
> > > > >     "producerId" : 0,
> > > > >     "supportsPartialProducer" : false
> > > > >   } ],
> > > > >
> > > > > After fix:
> > > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > > persistent://public/default/pt
> > > > > ...
> > > > >   "publishers" : [ {
> > > > >     "msgRateIn" : 0.0,
> > > > >     "msgThroughputIn" : 0.0,
> > > > >     "averageMsgSize" : 0.0,
> > > > >     "chunkedMessageRate" : 0.0,
> > > > >     "producerId" : 0,
> > > > >     "supportsPartialProducer" : true,
> > > > >     "producerName" : "standalone-0-1"
> > > > >   }, {
> > > > >     "msgRateIn" : 0.0,
> > > > >     "msgThroughputIn" : 0.0,
> > > > >     "averageMsgSize" : 0.0,
> > > > >     "chunkedMessageRate" : 0.0,
> > > > >     "producerId" : 0,
> > > > >     "supportsPartialProducer" : true,
> > > > >     "producerName" : "standalone-0-0"
> > > > >   } ],
> > > > > ...
> > > > >
> > > > >
> > > > > The broker side's producer name generation has been there since
> this
> > PR
> > > > > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > > > > ProducerName was automatically generated in this format
> > > > > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter}
> > until
> > > > > 2.10.
> > > > > So, a producer to a partitioned topic(2) results in the two
> producer
> > > > names,
> > > > > like the following.
> > > > > standalone-0-0
> > > > > standalone-0-1
> > > > >
> > > > > And since 2.10, by default, partitioned producers have the same
> > > producer
> > > > > name between partitions by this PR
> > > > > <https://github.com/apache/pulsar/pull/10279>(generated on the
> > client
> > > > > side).
> > > > > standalone-0-0
> > > > >
> > > > > Hence, the impacted versions(backward incompatibility issue
> > > > > <
> https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > >)
> > > > by
> > > > > the proposed fix(Option 1 below) are < 2.10.
> > > > >
> > > > > When we aggregate stats between partitions, the default(
> > > > > aggregatePublisherStatsByProducerName=false) aggregates the
> producer
> > > > stats
> > > > > by the index of the producer stat list from each partition. So,
> when
> > > > lucky,
> > > > > it could output a single producer stat. However, this method can be
> > > > buggy,
> > > > > as each partition could return a different size and index of the
> > > > > producer stat list.
> > > > >
> > > > >
> > > > > To fix the original issue(described here
> > > > > <https://github.com/apache/pulsar/pull/18254>), I think we have
> the
> > > > > following options.
> > > > >
> > > > > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName
> > > (the
> > > > > current PR here <https://github.com/apache/pulsar/pull/18254>) in
> > the
> > > > next
> > > > > release and live with behavior change where we get `topics
> > > > > partitioned-stats` per-producer-and-partition from old clients(Ver.
> > > > <2.10),
> > > > > instead of stats per-producer.
> > > > >
> > > > > Option 2: Defer the Option 1 fix and push it to the next major
> > > > > version(3.0.0), as this is a breaking change.
> > > > >
> > > > > Option 3: Keep aggregatePublisherStatsByProducerName config but
> > change
> > > > the
> > > > > default to aggregatePublisherStatsByProducerName=true.
> > > > >
> > > > > Option 4: As a long-term fix, create separate Admin-APIs for
> > publisher
> > > > and
> > > > > subscription stats and drop their stats from `topics
> > partitioned-stats`
> > > > as
> > > > > it is expensive to aggregate them on the fly. (for thousands of
> > > > publishers
> > > > > and subscriptions). Push this change to the next major version.
> > > > >
> > > > > Or other suggestions?
> > > > >
> > > > > Regards,
> > > > > Heesung
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> > > > heesung.s...@streamnative.io
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Dear Pulsar Community,
> > > > > >
> > > > > > We recently found a bug in `pulsar-admin topics partitioned-stats
> > > api`
> > > > > that
> > > > > > could incur a memory burst, high GC time, or OOM.
> > > > > >
> > > > > > For this issue, I proposed a fix
> > > > > > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> > > > > aggregatePublisherStatsByProducerName
> > > > > > config and always aggregating the publishers' stats by
> > publisherName,
> > > > > > instead of the list
> > > index(aggregatePublisherStatsByProducerName=false,
> > > > > > default).
> > > > > >
> > > > > >
> > > > > >    -  The index-based aggregation is inherently wrong in a highly
> > > > > >    concurrent producer environment(where the order and size of
> the
> > > > > publisher
> > > > > >    stat list are not guaranteed to be the same). The publisher
> > stats
> > > > > need to
> > > > > >    be aggregated by a unique key, preferably the producer
> > > > > >    name(aggregatePublisherStatsByProducerName=true).
> > > > > >
> > > > > >
> > > > > > However, this fix will break some of the old client's
> compatibility
> > > > since
> > > > > > the way Pulsar generates the producer name has changed over time,
> > as
> > > > > > described here
> > > > > > <
> > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > > >.
> > > > > >
> > > > > > As I replied here
> > > > > > <
> > https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363
> > > >,
> > > > > although
> > > > > > it is not desirable, I think we could be lenient on this change
> in
> > > the
> > > > > stat
> > > > > > API response(assuming thispublishers'stat struct is used for
> human
> > > > admins
> > > > > > only for ad-hoc checks).
> > > > > >
> > > > > > Are we OK with this non-backward-compatible fix for some of the
> old
> > > > > > clients? Or, do you have any other suggestions?
> > > > > >
> > > > > > One idea for a long-term fix could be:
> > > > > > When there are thousands of producers(consumers) for a
> > > > > > (partitioned-)topic, it is expensive to aggregate each
> > > > > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > > > > Alternatively,
> > > > > > for the next major version, I think we could further define
> > > > > > producers(subscriptions)' API like the below and drop the
> > publishers
> > > > and
> > > > > > subscriptions structs from topics (partitioned-)stats returns.
> > > > > >
> > > > > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > > > > pulsar-admin publishers stats my-producer
> > > > > >
> > > > > > # similarly for subscriptions
> > > > > >
> > > > > > Regards,
> > > > > > Heesung
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to