Hi Heesung, > update the PR and only fix the bug, the nested for-loop producer stat obj generation without changing any default configs. +1 Moreover, I think you can change the default value of aggregatePublisherStatsByProducerName to true since your final plan still keeps the index-based aggregation.
> adding the config will further complicate the system— this is anti to the “Invent and Simplify” principle. I agree with you, the simpler configuration, the better. Best Regards, Nozomi 2022年11月30日(水) 17:52 Heesung Sohn <heesung.s...@streamnative.io.invalid>: > Hi, > > I dont think adding another config around this is a good idea. Pulsar > already has so many configs, and to me adding the config will further > complicate the system— this is anti to the “Invent and Simplify” principle. > > Based on what we discussed, Pulsar cannot give up the index-based > aggregation due to the backward compatibility. Hence, I think we can update > the PR and only fix the bug, the nested for-loop producer stat obj > generation without changing any default configs. In this way, the memory > burst issue can be minimized at least. Plus, we can update the doc to warn > users. > > Please let me know what you think. > > Thanks, > Heesung > > On Tue, Nov 29, 2022 at 10:02 PM Lin Zhao <lin.z...@streamnative.io > .invalid> > wrote: > > > Hi Nozomi, > > > > Thank you for the explanation. It's clearer to me now. Maybe instead of > > aggregatePublisherStatsByProducerName, name it > allowStatsAggregationByIndex > > to be more descriptive. Defaults to false which makes this fix take > effect. > > > > On Mon, Nov 28, 2022 at 9:36 PM Nozomi Kurihara <nkuri...@apache.org> > > wrote: > > > > > Hi Lin, > > > > > > > What if we change aggregatePublisherStatsByProducerName to default to > > > true, and logs a warning if it's set to false? > > > > > > Currently "aggregatePublisherStatsByProducerName=true" means that stats > > is > > > aggregated by NOT ONLY producer name BUT index when clients don't > support > > > partial producer. > > > > > > > > > https://github.com/apache/pulsar/blob/6a719480b2afc84f5acb85fedf3accf8861eb97a/conf/broker.conf#L1476-L1478 > > > > > > Thus > > > > change aggregatePublisherStatsByProducerName to default to true > > > is not enough to avoid the performance issue since the index-based > > > aggregation can still work. > > > > > > To address this issue, https://github.com/apache/pulsar/pull/18254 > > > completely removes the index-based aggregation logic. > > > However this change will cause another problem, breaking the backward > > > compatibility. > > > Though I agree that the impact is less than the performance issue, I > > think > > > we should keep the existing behavior too. > > > > > > That's why I proposed "force" setting by which users have 3 options: > > > > > > (1) force=true > > > * stats is aggregated by only producer name > > > * for those who want to avoid the performance issue > > > > > > (2) force=false & aggregatePublisherStatsByProducerName=true > > > * stats is aggregated by producer name for clients which supports > partial > > > producer, otherwise by index > > > * for those who want to keep the backward-compatibility > > > > > > (3) force=false & aggregatePublisherStatsByProducerName=false > > > * stats is aggregated by only index > > > * (I'm not sure whether there are users who need this option) > > > > > > Best Regards, > > > Nozomi > > > > > > 2022年11月29日(火) 8:55 Lin Zhao <lin.z...@streamnative.io.invalid>: > > > > > > > In my opinion, the performance issue reported is to critical that it > > > should > > > > be considered a regression. The only option is to either revert the > > > initial > > > > change or fix it as soon as possible. > > > > > > > > The backward compatibility issue introduced to this fix, on the other > > > hand, > > > > doesn't impact the behavior of the broker other than the reported > json > > > from > > > > the stats call. It's a much lesser evil than broker running out of > > memory > > > > with tens of millions of ProducerStatsImpl as I have seen. > > > > > > > > I would prefer deprecating aggregatePublisherStatsByProducerName, but > > at > > > > the same time I see the reason behind keeping the feature intact in a > > > minor > > > > version change. Nozimo's proposal of > > > > forceAggregatePublisherStatsByProducerName is basically keeping* > > > > aggregatePublisherStatsByProducerName=false > > > > *an option right? What if we change > > aggregatePublisherStatsByProducerName > > > > to default to true, and logs a warning if it's set to false? > > > > > > > > On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara < > nkuri...@apache.org> > > > > wrote: > > > > > > > > > Hi Heesung, > > > > > > > > > > Thank you for opening this discussion. > > > > > > > > > > IMO backward-compatibility should be kept at least across minor > > > releases. > > > > > > > > > > Although the performance issue you mentioned (e.g. memory burst, > high > > > GC > > > > > and OOM) looks problematic, backward-compatibility is also > important. > > > > > I think which has higher priority depends on the case. > > > > > > > > > > Your current change seems to remove the index-based aggregation > > > > completely. > > > > > However I think we should keep room for choice. > > > > > > > > > > In order to allow users (here Pulsar server-side admins in > > particular) > > > to > > > > > choose the performance or backward-compatibility, how about > > > introducing a > > > > > "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"? > > > > > > > > > > Those who place more importance on the performance than > > > > > backward-compatibility can set this flag to true. > > > > > Others, those who want to keep backward-compatibility, set this > flag > > to > > > > > false. > > > > > > > > > > By the way, I'm not sure, is the producer name generation logic > > already > > > > > implemented in C++, Go and other clients? > > > > > If not so, first we should implement it before switching the > > > > > producer-name-based aggregation. > > > > > > > > > > Best Regards, > > > > > Nozomi > > > > > > > > > > 2022年11月17日(木) 8:51 Heesung Sohn <heesung.s...@streamnative.io > > > .invalid>: > > > > > > > > > > > Hi, > > > > > > > > > > > > To add more about the backward incompatibility issue > > > > > > < > > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182 > > > >, > > > > > > > > > > > > Before fix: > > > > > > % ./bin/pulsar-admin topics partitioned-stats > > > > > > persistent://public/default/pt > > > > > > ... > > > > > > "publishers" : [ { > > > > > > "msgRateIn" : 0.0, > > > > > > "msgThroughputIn" : 0.0, > > > > > > "averageMsgSize" : 0.0, > > > > > > "chunkedMessageRate" : 0.0, > > > > > > "producerId" : 0, > > > > > > "supportsPartialProducer" : false > > > > > > } ], > > > > > > > > > > > > After fix: > > > > > > % ./bin/pulsar-admin topics partitioned-stats > > > > > > persistent://public/default/pt > > > > > > ... > > > > > > "publishers" : [ { > > > > > > "msgRateIn" : 0.0, > > > > > > "msgThroughputIn" : 0.0, > > > > > > "averageMsgSize" : 0.0, > > > > > > "chunkedMessageRate" : 0.0, > > > > > > "producerId" : 0, > > > > > > "supportsPartialProducer" : true, > > > > > > "producerName" : "standalone-0-1" > > > > > > }, { > > > > > > "msgRateIn" : 0.0, > > > > > > "msgThroughputIn" : 0.0, > > > > > > "averageMsgSize" : 0.0, > > > > > > "chunkedMessageRate" : 0.0, > > > > > > "producerId" : 0, > > > > > > "supportsPartialProducer" : true, > > > > > > "producerName" : "standalone-0-0" > > > > > > } ], > > > > > > ... > > > > > > > > > > > > > > > > > > The broker side's producer name generation has been there since > > this > > > PR > > > > > > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating). > > > > > > ProducerName was automatically generated in this format > > > > > > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} > > > until > > > > > > 2.10. > > > > > > So, a producer to a partitioned topic(2) results in the two > > producer > > > > > names, > > > > > > like the following. > > > > > > standalone-0-0 > > > > > > standalone-0-1 > > > > > > > > > > > > And since 2.10, by default, partitioned producers have the same > > > > producer > > > > > > name between partitions by this PR > > > > > > <https://github.com/apache/pulsar/pull/10279>(generated on the > > > client > > > > > > side). > > > > > > standalone-0-0 > > > > > > > > > > > > Hence, the impacted versions(backward incompatibility issue > > > > > > < > > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182 > > > >) > > > > > by > > > > > > the proposed fix(Option 1 below) are < 2.10. > > > > > > > > > > > > When we aggregate stats between partitions, the default( > > > > > > aggregatePublisherStatsByProducerName=false) aggregates the > > producer > > > > > stats > > > > > > by the index of the producer stat list from each partition. So, > > when > > > > > lucky, > > > > > > it could output a single producer stat. However, this method can > be > > > > > buggy, > > > > > > as each partition could return a different size and index of the > > > > > > producer stat list. > > > > > > > > > > > > > > > > > > To fix the original issue(described here > > > > > > <https://github.com/apache/pulsar/pull/18254>), I think we have > > the > > > > > > following options. > > > > > > > > > > > > Option 1(proposed): Deprecate > aggregatePublisherStatsByProducerName > > > > (the > > > > > > current PR here <https://github.com/apache/pulsar/pull/18254>) > in > > > the > > > > > next > > > > > > release and live with behavior change where we get `topics > > > > > > partitioned-stats` per-producer-and-partition from old > clients(Ver. > > > > > <2.10), > > > > > > instead of stats per-producer. > > > > > > > > > > > > Option 2: Defer the Option 1 fix and push it to the next major > > > > > > version(3.0.0), as this is a breaking change. > > > > > > > > > > > > Option 3: Keep aggregatePublisherStatsByProducerName config but > > > change > > > > > the > > > > > > default to aggregatePublisherStatsByProducerName=true. > > > > > > > > > > > > Option 4: As a long-term fix, create separate Admin-APIs for > > > publisher > > > > > and > > > > > > subscription stats and drop their stats from `topics > > > partitioned-stats` > > > > > as > > > > > > it is expensive to aggregate them on the fly. (for thousands of > > > > > publishers > > > > > > and subscriptions). Push this change to the next major version. > > > > > > > > > > > > Or other suggestions? > > > > > > > > > > > > Regards, > > > > > > Heesung > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn < > > > > > heesung.s...@streamnative.io > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Dear Pulsar Community, > > > > > > > > > > > > > > We recently found a bug in `pulsar-admin topics > partitioned-stats > > > > api` > > > > > > that > > > > > > > could incur a memory burst, high GC time, or OOM. > > > > > > > > > > > > > > For this issue, I proposed a fix > > > > > > > <https://github.com/apache/pulsar/pull/18254> by deprecating > the > > > > > > aggregatePublisherStatsByProducerName > > > > > > > config and always aggregating the publishers' stats by > > > publisherName, > > > > > > > instead of the list > > > > index(aggregatePublisherStatsByProducerName=false, > > > > > > > default). > > > > > > > > > > > > > > > > > > > > > - The index-based aggregation is inherently wrong in a > highly > > > > > > > concurrent producer environment(where the order and size of > > the > > > > > > publisher > > > > > > > stat list are not guaranteed to be the same). The publisher > > > stats > > > > > > need to > > > > > > > be aggregated by a unique key, preferably the producer > > > > > > > name(aggregatePublisherStatsByProducerName=true). > > > > > > > > > > > > > > > > > > > > > However, this fix will break some of the old client's > > compatibility > > > > > since > > > > > > > the way Pulsar generates the producer name has changed over > time, > > > as > > > > > > > described here > > > > > > > < > > > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182 > > > > >. > > > > > > > > > > > > > > As I replied here > > > > > > > < > > > https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363 > > > > >, > > > > > > although > > > > > > > it is not desirable, I think we could be lenient on this change > > in > > > > the > > > > > > stat > > > > > > > API response(assuming thispublishers'stat struct is used for > > human > > > > > admins > > > > > > > only for ad-hoc checks). > > > > > > > > > > > > > > Are we OK with this non-backward-compatible fix for some of the > > old > > > > > > > clients? Or, do you have any other suggestions? > > > > > > > > > > > > > > One idea for a long-term fix could be: > > > > > > > When there are thousands of producers(consumers) for a > > > > > > > (partitioned-)topic, it is expensive to aggregate each > > > > > > > publisher(subscriptions)'s stats on-the-fly across the brokers. > > > > > > Alternatively, > > > > > > > for the next major version, I think we could further define > > > > > > > producers(subscriptions)' API like the below and drop the > > > publishers > > > > > and > > > > > > > subscriptions structs from topics (partitioned-)stats returns. > > > > > > > > > > > > > > pulsar-admin publishers list my-topic --last-pagination-key xyz > > > > > > > pulsar-admin publishers stats my-producer > > > > > > > > > > > > > > # similarly for subscriptions > > > > > > > > > > > > > > Regards, > > > > > > > Heesung > > > > > > > > > > > > > > > > > > > > > > > > > > > >