Hi Heesung,

> update the PR and only fix the bug, the nested for-loop producer stat obj
generation without changing any default configs.
+1
Moreover, I think you can change the default value of
aggregatePublisherStatsByProducerName to true since your final plan still
keeps the index-based aggregation.

>  adding the config will further complicate the system— this is anti to
the “Invent and Simplify” principle.
I agree with you, the simpler configuration, the better.

Best Regards,
Nozomi

2022年11月30日(水) 17:52 Heesung Sohn <heesung.s...@streamnative.io.invalid>:

> Hi,
>
> I dont think adding another config around this is a good idea. Pulsar
> already has so many configs, and to me adding the config will further
> complicate the system— this is anti to the “Invent and Simplify” principle.
>
> Based on what we discussed, Pulsar cannot give up the index-based
> aggregation due to the backward compatibility. Hence, I think we can update
> the PR and only fix the bug, the nested for-loop producer stat obj
> generation without changing any default configs. In this way, the memory
> burst issue can be minimized at least. Plus, we can update the doc to warn
> users.
>
> Please let me know what you think.
>
> Thanks,
> Heesung
>
> On Tue, Nov 29, 2022 at 10:02 PM Lin Zhao <lin.z...@streamnative.io
> .invalid>
> wrote:
>
> > Hi Nozomi,
> >
> > Thank you for the explanation. It's clearer to me now. Maybe instead of
> > aggregatePublisherStatsByProducerName, name it
> allowStatsAggregationByIndex
> > to be more descriptive. Defaults to false which makes this fix take
> effect.
> >
> > On Mon, Nov 28, 2022 at 9:36 PM Nozomi Kurihara <nkuri...@apache.org>
> > wrote:
> >
> > > Hi Lin,
> > >
> > > > What if we change aggregatePublisherStatsByProducerName to default to
> > > true, and logs a warning if it's set to false?
> > >
> > > Currently "aggregatePublisherStatsByProducerName=true" means that stats
> > is
> > > aggregated by NOT ONLY producer name BUT index when clients don't
> support
> > > partial producer.
> > >
> > >
> >
> https://github.com/apache/pulsar/blob/6a719480b2afc84f5acb85fedf3accf8861eb97a/conf/broker.conf#L1476-L1478
> > >
> > > Thus
> > > > change aggregatePublisherStatsByProducerName to default to true
> > > is not enough to avoid the performance issue since the index-based
> > > aggregation can still work.
> > >
> > > To address this issue, https://github.com/apache/pulsar/pull/18254
> > > completely removes the index-based aggregation logic.
> > > However this change will cause another problem, breaking the backward
> > > compatibility.
> > > Though I agree that the impact is less than the performance issue, I
> > think
> > > we should keep the existing behavior too.
> > >
> > > That's why I proposed "force" setting by which users have 3 options:
> > >
> > > (1) force=true
> > > * stats is aggregated by only producer name
> > > * for those who want to avoid the performance issue
> > >
> > > (2) force=false & aggregatePublisherStatsByProducerName=true
> > > * stats is aggregated by producer name for clients which supports
> partial
> > > producer, otherwise by index
> > > * for those who want to keep the backward-compatibility
> > >
> > > (3) force=false & aggregatePublisherStatsByProducerName=false
> > > * stats is aggregated by only index
> > > * (I'm not sure whether there are users who need this option)
> > >
> > > Best Regards,
> > > Nozomi
> > >
> > > 2022年11月29日(火) 8:55 Lin Zhao <lin.z...@streamnative.io.invalid>:
> > >
> > > > In my opinion, the performance issue reported is to critical that it
> > > should
> > > > be considered a regression. The only option is to either revert the
> > > initial
> > > > change or fix it as soon as possible.
> > > >
> > > > The backward compatibility issue introduced to this fix, on the other
> > > hand,
> > > > doesn't impact the behavior of the broker other than the reported
> json
> > > from
> > > > the stats call. It's a much lesser evil than broker running out of
> > memory
> > > > with tens of millions of ProducerStatsImpl as I have seen.
> > > >
> > > > I would prefer deprecating aggregatePublisherStatsByProducerName, but
> > at
> > > > the same time I see the reason behind keeping the feature intact in a
> > > minor
> > > > version change. Nozimo's proposal of
> > > > forceAggregatePublisherStatsByProducerName is basically keeping*
> > > > aggregatePublisherStatsByProducerName=false
> > > > *an option right? What if we change
> > aggregatePublisherStatsByProducerName
> > > > to default to true, and logs a warning if it's set to false?
> > > >
> > > > On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <
> nkuri...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi Heesung,
> > > > >
> > > > > Thank you for opening this discussion.
> > > > >
> > > > > IMO backward-compatibility should be kept at least across minor
> > > releases.
> > > > >
> > > > > Although the performance issue you mentioned (e.g. memory burst,
> high
> > > GC
> > > > > and OOM) looks problematic, backward-compatibility is also
> important.
> > > > > I think which has higher priority depends on the case.
> > > > >
> > > > > Your current change seems to remove the index-based aggregation
> > > > completely.
> > > > > However I think we should keep room for choice.
> > > > >
> > > > > In order to allow users (here Pulsar server-side admins in
> > particular)
> > > to
> > > > > choose the performance or backward-compatibility, how about
> > > introducing a
> > > > > "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
> > > > >
> > > > > Those who place more importance on the performance than
> > > > > backward-compatibility can set this flag to true.
> > > > > Others, those who want to keep backward-compatibility, set this
> flag
> > to
> > > > > false.
> > > > >
> > > > > By the way, I'm not sure, is the producer name generation logic
> > already
> > > > > implemented in C++, Go and other clients?
> > > > > If not so, first we should implement it before switching the
> > > > > producer-name-based aggregation.
> > > > >
> > > > > Best Regards,
> > > > > Nozomi
> > > > >
> > > > > 2022年11月17日(木) 8:51 Heesung Sohn <heesung.s...@streamnative.io
> > > .invalid>:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > To add more about the backward incompatibility issue
> > > > > > <
> > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > > >,
> > > > > >
> > > > > > Before fix:
> > > > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > > > persistent://public/default/pt
> > > > > > ...
> > > > > >   "publishers" : [ {
> > > > > >     "msgRateIn" : 0.0,
> > > > > >     "msgThroughputIn" : 0.0,
> > > > > >     "averageMsgSize" : 0.0,
> > > > > >     "chunkedMessageRate" : 0.0,
> > > > > >     "producerId" : 0,
> > > > > >     "supportsPartialProducer" : false
> > > > > >   } ],
> > > > > >
> > > > > > After fix:
> > > > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > > > persistent://public/default/pt
> > > > > > ...
> > > > > >   "publishers" : [ {
> > > > > >     "msgRateIn" : 0.0,
> > > > > >     "msgThroughputIn" : 0.0,
> > > > > >     "averageMsgSize" : 0.0,
> > > > > >     "chunkedMessageRate" : 0.0,
> > > > > >     "producerId" : 0,
> > > > > >     "supportsPartialProducer" : true,
> > > > > >     "producerName" : "standalone-0-1"
> > > > > >   }, {
> > > > > >     "msgRateIn" : 0.0,
> > > > > >     "msgThroughputIn" : 0.0,
> > > > > >     "averageMsgSize" : 0.0,
> > > > > >     "chunkedMessageRate" : 0.0,
> > > > > >     "producerId" : 0,
> > > > > >     "supportsPartialProducer" : true,
> > > > > >     "producerName" : "standalone-0-0"
> > > > > >   } ],
> > > > > > ...
> > > > > >
> > > > > >
> > > > > > The broker side's producer name generation has been there since
> > this
> > > PR
> > > > > > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > > > > > ProducerName was automatically generated in this format
> > > > > > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter}
> > > until
> > > > > > 2.10.
> > > > > > So, a producer to a partitioned topic(2) results in the two
> > producer
> > > > > names,
> > > > > > like the following.
> > > > > > standalone-0-0
> > > > > > standalone-0-1
> > > > > >
> > > > > > And since 2.10, by default, partitioned producers have the same
> > > > producer
> > > > > > name between partitions by this PR
> > > > > > <https://github.com/apache/pulsar/pull/10279>(generated on the
> > > client
> > > > > > side).
> > > > > > standalone-0-0
> > > > > >
> > > > > > Hence, the impacted versions(backward incompatibility issue
> > > > > > <
> > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > > >)
> > > > > by
> > > > > > the proposed fix(Option 1 below) are < 2.10.
> > > > > >
> > > > > > When we aggregate stats between partitions, the default(
> > > > > > aggregatePublisherStatsByProducerName=false) aggregates the
> > producer
> > > > > stats
> > > > > > by the index of the producer stat list from each partition. So,
> > when
> > > > > lucky,
> > > > > > it could output a single producer stat. However, this method can
> be
> > > > > buggy,
> > > > > > as each partition could return a different size and index of the
> > > > > > producer stat list.
> > > > > >
> > > > > >
> > > > > > To fix the original issue(described here
> > > > > > <https://github.com/apache/pulsar/pull/18254>), I think we have
> > the
> > > > > > following options.
> > > > > >
> > > > > > Option 1(proposed): Deprecate
> aggregatePublisherStatsByProducerName
> > > > (the
> > > > > > current PR here <https://github.com/apache/pulsar/pull/18254>)
> in
> > > the
> > > > > next
> > > > > > release and live with behavior change where we get `topics
> > > > > > partitioned-stats` per-producer-and-partition from old
> clients(Ver.
> > > > > <2.10),
> > > > > > instead of stats per-producer.
> > > > > >
> > > > > > Option 2: Defer the Option 1 fix and push it to the next major
> > > > > > version(3.0.0), as this is a breaking change.
> > > > > >
> > > > > > Option 3: Keep aggregatePublisherStatsByProducerName config but
> > > change
> > > > > the
> > > > > > default to aggregatePublisherStatsByProducerName=true.
> > > > > >
> > > > > > Option 4: As a long-term fix, create separate Admin-APIs for
> > > publisher
> > > > > and
> > > > > > subscription stats and drop their stats from `topics
> > > partitioned-stats`
> > > > > as
> > > > > > it is expensive to aggregate them on the fly. (for thousands of
> > > > > publishers
> > > > > > and subscriptions). Push this change to the next major version.
> > > > > >
> > > > > > Or other suggestions?
> > > > > >
> > > > > > Regards,
> > > > > > Heesung
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> > > > > heesung.s...@streamnative.io
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Dear Pulsar Community,
> > > > > > >
> > > > > > > We recently found a bug in `pulsar-admin topics
> partitioned-stats
> > > > api`
> > > > > > that
> > > > > > > could incur a memory burst, high GC time, or OOM.
> > > > > > >
> > > > > > > For this issue, I proposed a fix
> > > > > > > <https://github.com/apache/pulsar/pull/18254> by deprecating
> the
> > > > > > aggregatePublisherStatsByProducerName
> > > > > > > config and always aggregating the publishers' stats by
> > > publisherName,
> > > > > > > instead of the list
> > > > index(aggregatePublisherStatsByProducerName=false,
> > > > > > > default).
> > > > > > >
> > > > > > >
> > > > > > >    -  The index-based aggregation is inherently wrong in a
> highly
> > > > > > >    concurrent producer environment(where the order and size of
> > the
> > > > > > publisher
> > > > > > >    stat list are not guaranteed to be the same). The publisher
> > > stats
> > > > > > need to
> > > > > > >    be aggregated by a unique key, preferably the producer
> > > > > > >    name(aggregatePublisherStatsByProducerName=true).
> > > > > > >
> > > > > > >
> > > > > > > However, this fix will break some of the old client's
> > compatibility
> > > > > since
> > > > > > > the way Pulsar generates the producer name has changed over
> time,
> > > as
> > > > > > > described here
> > > > > > > <
> > > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > > > >.
> > > > > > >
> > > > > > > As I replied here
> > > > > > > <
> > > https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363
> > > > >,
> > > > > > although
> > > > > > > it is not desirable, I think we could be lenient on this change
> > in
> > > > the
> > > > > > stat
> > > > > > > API response(assuming thispublishers'stat struct is used for
> > human
> > > > > admins
> > > > > > > only for ad-hoc checks).
> > > > > > >
> > > > > > > Are we OK with this non-backward-compatible fix for some of the
> > old
> > > > > > > clients? Or, do you have any other suggestions?
> > > > > > >
> > > > > > > One idea for a long-term fix could be:
> > > > > > > When there are thousands of producers(consumers) for a
> > > > > > > (partitioned-)topic, it is expensive to aggregate each
> > > > > > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > > > > > Alternatively,
> > > > > > > for the next major version, I think we could further define
> > > > > > > producers(subscriptions)' API like the below and drop the
> > > publishers
> > > > > and
> > > > > > > subscriptions structs from topics (partitioned-)stats returns.
> > > > > > >
> > > > > > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > > > > > pulsar-admin publishers stats my-producer
> > > > > > >
> > > > > > > # similarly for subscriptions
> > > > > > >
> > > > > > > Regards,
> > > > > > > Heesung
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to