Hello Joe and Enrico: I agree with what you've been emphasizing that we need to fix these issues at the root cause. During the maintenance of the Go SDK, we have encountered many stuck problems since version 0.4.0, some of which belonged to the logic errors handled by the Go SDK itself, and some of which were caused by the user's wrong use of the Go SDK, until the previous 0.8 .0 version, the Go SDK is used on a large scale in our environment. In the iterations of these versions, we have been trying to completely fix these BUGs. This is what our maintainers have been working hard on and it is also a final form we expect Pulsar - everything looks OK.
However, during the iteration of the Go SDK version from 0.4.0 to 0.8.0, users of our production environment encountered similar problems many times. Again, for a user in a production environment, for example, the current user encounters a situation where consumption is blocked. The user finds you and expects us to use some means to quickly allow consumers to continue to consume news? Or do we keep users in the production environment in a stuck state until we find the root cause of the problem and fix it for users, pushing users to upgrade. I think everyone's answer tends to be the latter. We will not directly expose the hack operations of unload topic and unload sub to users, but to Pulsar's operation and maintenance personnel, so it is more like an operation and maintenance tool , rather than the interface called by the user. So I think this impact is controllable for Pulsar as a whole, which is why I support it. Again, this PIP is more about buying more time for us to locate the problem while minimizing the impact on production users. It’s not that with this interface we don’t locate the real causes of the stuck. On the contrary, we are making more trade-offs between users and positioning issues, buying us more time for positioning issues. -- Thanks xiaolong ran PengHui Li <peng...@apache.org> 于2023年1月18日周三 11:48写道: > > What kind of problems is this trying to fix? > And why cannot that be solved by client-side fixes? > > Yes, most of the issue is from the client side, rarely from the broker. > But the application also needs time to fix the issue to release and deploy > the fix > to the production environment. Unloading the subscription is just a > temporary > way to mitigate the issue and reduce the impact. It will not fix the issue > completely. > > What I learned is to capture the heap dump, topics stats, internal stats, > and logs from the broker and client and then try to unload the topic to > see if the problem is mitigated. If not, then try to restart the broker or > client, > most of the time, the problem can be mitigated in this way. > Then we can continue to reproduce the issue and investigate the issue > from the captured heap dump and logs. > > > In shared sub issues, it's hard to pinpoint which consumer/where > the problem lies, and to reset that one at the client. The totality of > state spread between the brokers and all the consumers of the shared sub > needs to be put together . Is that why we are doing this? > > From my experience, most are from Shared and key shared subscriptions. > Most of the issues come from misuse, rarely from the BUGs of brokers or > clients. > > Regards, > Penghui > > > On Wed, Jan 18, 2023 at 11:31 AM Joe F <joefranc...@gmail.com> wrote: > > > Inclined to agree with Enrico. If it's a hard problem, it will repeat, > and > > this is not helping. If it's some race on the client, it will occur > > randomly and rarely, and this unload sub will get programmed in as a way > of > > life. > > > > >If you don't think unloading the subscription can't help anything. > > Unloading > > the topic should be the same. From my experience, most of the unloading > > topic operations are to mitigate the problems related to message > > consumption. > > > > Comparisons with unloading a topic are not the bar here, as that is a > first > > class broker utility that is needed for operational reasons outside of > > "fixing" consumer side issues . The side effect of using "unload topic" > is > > a loss of transient topic state. I will fully agree that this side-effect > > has been pervasively abused for fixing problems (ala Ctlrl-Alt-Del) , > but > > that's not the rationale for having an unload topic utility. > > > > What kind of problems is this trying to fix? > > And why cannot that be solved by client-side fixes? > > > > In shared sub issues, it's hard to pinpoint which consumer/where > > the problem lies, and to reset that one at the client. The totality of > > state spread between the brokers and all the consumers of the shared sub > > needs to be put together . Is that why we are doing this? > > > > > > On Tue, Jan 17, 2023 at 5:30 PM PengHui Li <peng...@apache.org> wrote: > > > > > I agree that if we encounter a stuck consumption issue, we should > > continue > > > to find the root cause of the problem. > > > > > > Subscription unloading is just an option to mitigate the impact first. > > > Maybe it can mitigate the issue for 1 hour sometimes. Especially in > > > key_shared subscription. Sometimes it's not a BUG from Pulsar. > > > But users need time to fix the issue. But it doesn't make sense to let > > > the impaction continues until the fix is applied. > > > > > > I also helped many people to troubleshoot the stuck consumption > > > issue related to key_shared subscriptions and transactions etc. > > > In most cases, unloading the topic can mitigate the impact. > > > For example, due to the un-catched exception, the dispatch thread > > > stopped reading messages from the managed-ledger. The exception > > > is a very infrequent occurrence. Unloading the topic is the best choice > > for > > > now, right? > > > > > > If you don't think unloading the subscription can't help anything. > > > Unloading > > > the topic should be the same. From my experience, most of the unloading > > > topic operations are to mitigate the problems related to message > > > consumption. > > > > > > Best, > > > Penghui > > > > > > On Tue, Jan 17, 2023 at 11:09 PM Enrico Olivelli <eolive...@gmail.com> > > > wrote: > > > > > > > Il giorno lun 16 gen 2023 alle ore 11:58 r...@apache.org > > > > <ranxiaolong...@gmail.com> ha scritto: > > > > > > > > > > I agree with @Enrico @Bo, if we encounter a subscribe stuck > > situation, > > > we > > > > > must continue to spend more time to locate and fix this problem, > > which > > > is > > > > > what we have been doing. > > > > > > > > > > But let's think about this problem from another angle. At this > time, > > a > > > > user > > > > > in the production environment encounters a consumer stuck > situation, > > > what > > > > > should we do? For a user in a production environment, our first > > > reaction > > > > > when encountering a problem is how to quickly recover and how to > > > quickly > > > > > reduce user losses. Even at this point in time, we don't think > about > > > > > whether this is a bug on the Broker side, a bug on the SDK side, > or a > > > bug > > > > > used by the user himself? In the process of fast recovery, our most > > > > common > > > > > method is to quickly re-establish the connection between the broker > > and > > > > the > > > > > client through the topic specified by unload. In this process, we > try > > > to > > > > > retain as much context as possible to assist us in the subsequent > > > > > continuous positioning and repair of this problem. > > > > > > > > > > So I don't think these two things conflict. Why we expose the admin > > CLI > > > > of > > > > > the unload topic is why we expect to expose the unload subscribe. > If > > we > > > > > stand from the perspective of a developer, we definitely want to > > > > completely > > > > > fix the problem that caused the stuck. If we think about this issue > > > from > > > > > the perspective of the user, when a scenario such as consumer stuck > > > > occurs > > > > > to the user, the user does not care about the specific cause of the > > > > > problem, but expects the business to recover quickly in the > shortest > > > > > possible time to avoid further loss. > > > > > > > > > > I admit that this is a relatively hacky way, but it can indeed > solve > > > the > > > > > problems we are currently encountering, and at the same time, it > will > > > not > > > > > cause a major conflict with Pulsar's existing logic. So I still > > insist > > > on > > > > > agreeing with yubiao's point of view. > > > > > > > > > > > > > > > > Usually when a subscription is "stuck" even if you unload the topic > > > > it returns to the "stuck" state again if you don't solve the problem. > > > > > > > > This is a very common issue with Pulsar users, I am spending much > time > > > > helping users to troubleshoot their production problems and unloading > > the > > > > topic > > > > is never a solution, it can give you seconds, minutes or hours of > > > > "working state", > > > > then the problem will happen again. > > > > > > > > You say that it can solve the problems you are encountering. > > > > Could you please give more context ? (in Slack if this is not > > > > something that can be discussed in public) > > > > I apologise if I seem too much of a skeptic this time, I am sure > that > > > > you have a real problem > > > > and you want to fix it, but I would like to help you find the best > way. > > > > > > > > Pulsar is used by many people and we shouldn't add hacky tools for > > > > temporary workarounds. > > > > Once we deliver an API we should maintain it for an unlimited time. > > > > > > > > You could patch your system and use the patched version temporarily > > > > until you find the root case. > > > > > > > > Enrico > > > > > > > > > > > > > > -- > > > > > Thanks > > > > > Xiaolong Ran > > > > > > > > > > > > > > > Yubiao Feng <yubiao.f...@streamnative.io.invalid> 于2023年1月15日周日 > > > 20:59写道: > > > > > > > > > > > Hi Qiang > > > > > > > > > > > > > 1. How do you handle the race condition when you are trying to > > > > unload the > > > > > > subscription, and the new consumer wants to subscribe to this > > > > subscription > > > > > > at the same time? I'm unsure if it has the race condition. I just > > > want > > > > to > > > > > > remind you about that.:) > > > > > > > > > > > > These methods `addConsumer`, `removeConsumer` all have > synchronized > > > > locks, > > > > > > we also add synchronized lock when executing `reset subscription` > > can > > > > solve > > > > > > the problem. > > > > > > > > > > > > > 2. Would you like to add some restful API design to clarify the > > > > > > implementation? > > > > > > > > > > > > Already added the rest API design in the proposal > > > > > > https://github.com/apache/pulsar/issues/19187 > > > > > > > > > > > > On Thu, Jan 12, 2023 at 3:22 PM <mattisonc...@gmail.com> wrote: > > > > > > > > > > > > > Hi, Yubiao > > > > > > > > > > > > > > I agree with this idea because some users care about the > > production > > > > rate. > > > > > > > They don't want to unload the whole topic to fix the > subscription > > > > > > problem. > > > > > > > > > > > > > > I've got some questions: > > > > > > > > > > > > > > 1. How do you handle the race condition when you are trying to > > > > unload the > > > > > > > subscription, and the new consumer wants to subscribe to this > > > > > > subscription > > > > > > > at the same time? I'm unsure if it has the race condition. I > just > > > > want to > > > > > > > remind you about that. :) > > > > > > > 2. Would you like to add some restful API design to clarify the > > > > > > > implementation? > > > > > > > a. Request method > > > > > > > b. Request path > > > > > > > c. Response code > > > > > > > d. etc. > > > > > > > > > > > > > > > > > > > > > Thanks for your work. > > > > > > > Mattison > > > > > > > On Jan 11, 2023, 17:01 +0800, Yubiao Feng < > > > > yubiao.f...@streamnative.io > > > > > > .invalid>, > > > > > > > wrote: > > > > > > > > Hi community > > > > > > > > > > > > > > > > I am starting a DISCUSS for PIP-240: A new API to unload > > > > subscriptions. > > > > > > > > > > > > > > > > PIP issue: https://github.com/apache/pulsar/issues/19187 > > > > > > > > > > > > > > > > ### Motivation > > > > > > > > > > > > > > > > We sometimes try to unload the topic to resolve some > > > > consumption-stop > > > > > > > > issues. But the unloading topic will also impact the producer > > > side. > > > > > > > > > > > > > > > > ### Goal > > > > > > > > > > > > > > > > Providing a new API to unload the subscription dimension > > triggers > > > > > > > > reconnection of all consumers on that subscription and > > > > reconnection is > > > > > > > > guaranteed by the client. The API will be used in these ways: > > > > > > > > - unload special subscription of one topic(or partitioned > > topic) > > > > > > > > - unload all subscriptions of one topic(or partitioned topic) > > > > > > > > - unload subscriptions of one topic(or partitioned topic) by > > > > regular > > > > > > > > expression > > > > > > > > - If a reader's subscription name is not set, a random > > > subscription > > > > > > name > > > > > > > > prefixed with 'multiTopicsReader-' or 'reader-' will be used, > > and > > > > users > > > > > > > can > > > > > > > > uninstall these subscriptions using regular expressions. > > > > > > > > > > > > > > > > In addition to triggering consumer disconnection, Unloading > > > > Subscribers > > > > > > > > will restart the Dispatcher, which resets the redeliver > message > > > > queue > > > > > > and > > > > > > > > delayed message queue in the Broker's memory, which can help > > > > resolve > > > > > > > issues > > > > > > > > caused by an abnormal dispatcher state. However, the > execution > > > > flow of > > > > > > > > Unloading Subscribers does not include a restart of the > Managed > > > > Cursor > > > > > > > > related to this dispatcher; if there is a problem with the > > > cursor, > > > > we > > > > > > can > > > > > > > > only rely on the unload topic to solve it. > > > > > > > > > > > > > > > > Note: From the client's perspective, this connection may be > > > shared > > > > by > > > > > > > > consumers, producers, and transactions, so Unloading > > Subscribers > > > > maybe > > > > > > > > impact the producer and transaction. > > > > > > > > > > > > > > > > #### These scenarios are not supported > > > > > > > > - Functions `message-dedup`, `geo-replication,` and > > > `shadow-topic` > > > > also > > > > > > > > read messages from the topic, but Unloading subscribers will > > not > > > > > > support > > > > > > > > triggering restarts of these three functions( because the > > cursor > > > is > > > > > > used > > > > > > > > directly to read the data in these scenarios, not the > consumer > > or > > > > > > reader > > > > > > > ). > > > > > > > > - The Compression task(subscription name is `__compaction`) > > also > > > > use a > > > > > > > > reader to read data, but Unloading Subscribers does not > support > > > it > > > > > > > because > > > > > > > > this task creates a new reader each time it starts. > > > > > > > > - Do not support all topics related to Transaction features. > > > > > > > > - `__transaction_buffer_snapshot` works with the task TB > > recover, > > > > and > > > > > > > > this task will create a new reader each time they start. > > > > > > > > - `__transaction_pending_ack` works with the task Transaction > > > > Pending > > > > > > Ack > > > > > > > > Store replay, and this task will use managed cursor directly > to > > > > read > > > > > > > data. > > > > > > > > - `__transaction_log_xxx` works with the task Transaction > Log, > > > > which > > > > > > will > > > > > > > > use managed cursor directly to read data. > > > > > > > > - `transaction_coordinator_assign` No data will be written on > > > this > > > > > > topic. > > > > > > > > > > > > > > > > #### Special system topic supports > > > > > > > > The system topic `__change_events` is used to support > > topic-level > > > > > > > policies, > > > > > > > > there may also be some message delivery issues in this > > scenario, > > > so > > > > > > > > Unloading Subscribers will support this topic. > > > > > > > > > > > > > > > > ### API Changes > > > > > > > > > > > > > > > > #### For persistent topic > > > > > > > > ``` > > > > > > > > pulsar-admin persistent unload {topic_name} -s {sub_name} > > > > > > > > ``` > > > > > > > > > > > > > > > > #### For non-persistent topic > > > > > > > > ``` > > > > > > > > pulsar-admin non-persistent unload {topic_name} -s {sub_name} > > > > > > > > ``` > > > > > > > > > > > > > > > > #### Explain the param `-s` > > > > > > > > - set param `-s` to special sub name to unload special > > > subscription > > > > > > > > - set param `-s` to `**` to unload all subscriptions under > this > > > > topic > > > > > > > > - set param `-s` to `regexp` to unload a batch subscriptions > > > under > > > > this > > > > > > > > topic > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > Yubiao Feng > > > > > > > > > > > > > > > > > > > > > > >