I started the voting process for this PIP Thanks Yubiao
On Thu, Jan 19, 2023 at 5:55 PM Haiting Jiang <jianghait...@gmail.com> wrote: > I agree with Penghui & Xiaolong, > > 1. Restarting a service is usually the most common and effective > option for service maintainers to recover a service and minimize the > business loss. > With this subscription unloading, we can reduce the impact > significantly, as unloading topics will affect message writing, which > has much more influence for online business. > > 2. Having this subscription doesn't conflict with solving the real > issue. Like broker restarting, it just can buy us more time to locate > the real problem. > > BR, > Haiting > > On Thu, Jan 19, 2023 at 11:42 AM r...@apache.org > <ranxiaolong...@gmail.com> wrote: > > > > Hello Joe and Enrico: > > > > I agree with what you've been emphasizing that we need to fix these > issues > > at the root cause. During the maintenance of the Go SDK, we have > > encountered many stuck problems since version 0.4.0, some of which > belonged > > to the logic errors handled by the Go SDK itself, and some of which were > > caused by the user's wrong use of the Go SDK, until the previous 0.8 .0 > > version, the Go SDK is used on a large scale in our environment. In the > > iterations of these versions, we have been trying to completely fix these > > BUGs. This is what our maintainers have been working hard on and it is > also > > a final form we expect Pulsar - everything looks OK. > > > > However, during the iteration of the Go SDK version from 0.4.0 to 0.8.0, > > users of our production environment encountered similar problems many > > times. Again, for a user in a production environment, for example, the > > current user encounters a situation where consumption is blocked. The > user > > finds you and expects us to use some means to quickly allow consumers to > > continue to consume news? Or do we keep users in the production > environment > > in a stuck state until we find the root cause of the problem and fix it > for > > users, pushing users to upgrade. I think everyone's answer tends to be > the > > latter. We will not directly expose the hack operations of unload topic > and > > unload sub to users, but to Pulsar's operation and maintenance personnel, > > so it is more like an operation and maintenance tool , rather than the > > interface called by the user. So I think this impact is controllable for > > Pulsar as a whole, which is why I support it. > > > > Again, this PIP is more about buying more time for us to locate the > problem > > while minimizing the impact on production users. It’s not that with this > > interface we don’t locate the real causes of the stuck. On the contrary, > we > > are making more trade-offs between users and positioning issues, buying > us > > more time for positioning issues. > > > > -- > > Thanks > > xiaolong ran > > > > PengHui Li <peng...@apache.org> 于2023年1月18日周三 11:48写道: > > > > > > What kind of problems is this trying to fix? > > > And why cannot that be solved by client-side fixes? > > > > > > Yes, most of the issue is from the client side, rarely from the broker. > > > But the application also needs time to fix the issue to release and > deploy > > > the fix > > > to the production environment. Unloading the subscription is just a > > > temporary > > > way to mitigate the issue and reduce the impact. It will not fix the > issue > > > completely. > > > > > > What I learned is to capture the heap dump, topics stats, internal > stats, > > > and logs from the broker and client and then try to unload the topic to > > > see if the problem is mitigated. If not, then try to restart the > broker or > > > client, > > > most of the time, the problem can be mitigated in this way. > > > Then we can continue to reproduce the issue and investigate the issue > > > from the captured heap dump and logs. > > > > > > > In shared sub issues, it's hard to pinpoint which consumer/where > > > the problem lies, and to reset that one at the client. The totality of > > > state spread between the brokers and all the consumers of the shared > sub > > > needs to be put together . Is that why we are doing this? > > > > > > From my experience, most are from Shared and key shared subscriptions. > > > Most of the issues come from misuse, rarely from the BUGs of brokers or > > > clients. > > > > > > Regards, > > > Penghui > > > > > > > > > On Wed, Jan 18, 2023 at 11:31 AM Joe F <joefranc...@gmail.com> wrote: > > > > > > > Inclined to agree with Enrico. If it's a hard problem, it will > repeat, > > > and > > > > this is not helping. If it's some race on the client, it will occur > > > > randomly and rarely, and this unload sub will get programmed in as a > way > > > of > > > > life. > > > > > > > > >If you don't think unloading the subscription can't help anything. > > > > Unloading > > > > the topic should be the same. From my experience, most of the > unloading > > > > topic operations are to mitigate the problems related to message > > > > consumption. > > > > > > > > Comparisons with unloading a topic are not the bar here, as that is a > > > first > > > > class broker utility that is needed for operational reasons outside > of > > > > "fixing" consumer side issues . The side effect of using "unload > topic" > > > is > > > > a loss of transient topic state. I will fully agree that this > side-effect > > > > has been pervasively abused for fixing problems (ala Ctlrl-Alt-Del) > , > > > but > > > > that's not the rationale for having an unload topic utility. > > > > > > > > What kind of problems is this trying to fix? > > > > And why cannot that be solved by client-side fixes? > > > > > > > > In shared sub issues, it's hard to pinpoint which consumer/where > > > > the problem lies, and to reset that one at the client. The totality > of > > > > state spread between the brokers and all the consumers of the shared > sub > > > > needs to be put together . Is that why we are doing this? > > > > > > > > > > > > On Tue, Jan 17, 2023 at 5:30 PM PengHui Li <peng...@apache.org> > wrote: > > > > > > > > > I agree that if we encounter a stuck consumption issue, we should > > > > continue > > > > > to find the root cause of the problem. > > > > > > > > > > Subscription unloading is just an option to mitigate the impact > first. > > > > > Maybe it can mitigate the issue for 1 hour sometimes. Especially in > > > > > key_shared subscription. Sometimes it's not a BUG from Pulsar. > > > > > But users need time to fix the issue. But it doesn't make sense to > let > > > > > the impaction continues until the fix is applied. > > > > > > > > > > I also helped many people to troubleshoot the stuck consumption > > > > > issue related to key_shared subscriptions and transactions etc. > > > > > In most cases, unloading the topic can mitigate the impact. > > > > > For example, due to the un-catched exception, the dispatch thread > > > > > stopped reading messages from the managed-ledger. The exception > > > > > is a very infrequent occurrence. Unloading the topic is the best > choice > > > > for > > > > > now, right? > > > > > > > > > > If you don't think unloading the subscription can't help anything. > > > > > Unloading > > > > > the topic should be the same. From my experience, most of the > unloading > > > > > topic operations are to mitigate the problems related to message > > > > > consumption. > > > > > > > > > > Best, > > > > > Penghui > > > > > > > > > > On Tue, Jan 17, 2023 at 11:09 PM Enrico Olivelli < > eolive...@gmail.com> > > > > > wrote: > > > > > > > > > > > Il giorno lun 16 gen 2023 alle ore 11:58 r...@apache.org > > > > > > <ranxiaolong...@gmail.com> ha scritto: > > > > > > > > > > > > > > I agree with @Enrico @Bo, if we encounter a subscribe stuck > > > > situation, > > > > > we > > > > > > > must continue to spend more time to locate and fix this > problem, > > > > which > > > > > is > > > > > > > what we have been doing. > > > > > > > > > > > > > > But let's think about this problem from another angle. At this > > > time, > > > > a > > > > > > user > > > > > > > in the production environment encounters a consumer stuck > > > situation, > > > > > what > > > > > > > should we do? For a user in a production environment, our first > > > > > reaction > > > > > > > when encountering a problem is how to quickly recover and how > to > > > > > quickly > > > > > > > reduce user losses. Even at this point in time, we don't think > > > about > > > > > > > whether this is a bug on the Broker side, a bug on the SDK > side, > > > or a > > > > > bug > > > > > > > used by the user himself? In the process of fast recovery, our > most > > > > > > common > > > > > > > method is to quickly re-establish the connection between the > broker > > > > and > > > > > > the > > > > > > > client through the topic specified by unload. In this process, > we > > > try > > > > > to > > > > > > > retain as much context as possible to assist us in the > subsequent > > > > > > > continuous positioning and repair of this problem. > > > > > > > > > > > > > > So I don't think these two things conflict. Why we expose the > admin > > > > CLI > > > > > > of > > > > > > > the unload topic is why we expect to expose the unload > subscribe. > > > If > > > > we > > > > > > > stand from the perspective of a developer, we definitely want > to > > > > > > completely > > > > > > > fix the problem that caused the stuck. If we think about this > issue > > > > > from > > > > > > > the perspective of the user, when a scenario such as consumer > stuck > > > > > > occurs > > > > > > > to the user, the user does not care about the specific cause > of the > > > > > > > problem, but expects the business to recover quickly in the > > > shortest > > > > > > > possible time to avoid further loss. > > > > > > > > > > > > > > I admit that this is a relatively hacky way, but it can indeed > > > solve > > > > > the > > > > > > > problems we are currently encountering, and at the same time, > it > > > will > > > > > not > > > > > > > cause a major conflict with Pulsar's existing logic. So I still > > > > insist > > > > > on > > > > > > > agreeing with yubiao's point of view. > > > > > > > > > > > > > > > > > > > > > > > > Usually when a subscription is "stuck" even if you unload the > topic > > > > > > it returns to the "stuck" state again if you don't solve the > problem. > > > > > > > > > > > > This is a very common issue with Pulsar users, I am spending much > > > time > > > > > > helping users to troubleshoot their production problems and > unloading > > > > the > > > > > > topic > > > > > > is never a solution, it can give you seconds, minutes or hours of > > > > > > "working state", > > > > > > then the problem will happen again. > > > > > > > > > > > > You say that it can solve the problems you are encountering. > > > > > > Could you please give more context ? (in Slack if this is not > > > > > > something that can be discussed in public) > > > > > > I apologise if I seem too much of a skeptic this time, I am sure > > > that > > > > > > you have a real problem > > > > > > and you want to fix it, but I would like to help you find the > best > > > way. > > > > > > > > > > > > Pulsar is used by many people and we shouldn't add hacky tools > for > > > > > > temporary workarounds. > > > > > > Once we deliver an API we should maintain it for an unlimited > time. > > > > > > > > > > > > You could patch your system and use the patched version > temporarily > > > > > > until you find the root case. > > > > > > > > > > > > Enrico > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Thanks > > > > > > > Xiaolong Ran > > > > > > > > > > > > > > > > > > > > > Yubiao Feng <yubiao.f...@streamnative.io.invalid> > 于2023年1月15日周日 > > > > > 20:59写道: > > > > > > > > > > > > > > > Hi Qiang > > > > > > > > > > > > > > > > > 1. How do you handle the race condition when you are > trying to > > > > > > unload the > > > > > > > > subscription, and the new consumer wants to subscribe to this > > > > > > subscription > > > > > > > > at the same time? I'm unsure if it has the race condition. I > just > > > > > want > > > > > > to > > > > > > > > remind you about that.:) > > > > > > > > > > > > > > > > These methods `addConsumer`, `removeConsumer` all have > > > synchronized > > > > > > locks, > > > > > > > > we also add synchronized lock when executing `reset > subscription` > > > > can > > > > > > solve > > > > > > > > the problem. > > > > > > > > > > > > > > > > > 2. Would you like to add some restful API design to > clarify the > > > > > > > > implementation? > > > > > > > > > > > > > > > > Already added the rest API design in the proposal > > > > > > > > https://github.com/apache/pulsar/issues/19187 > > > > > > > > > > > > > > > > On Thu, Jan 12, 2023 at 3:22 PM <mattisonc...@gmail.com> > wrote: > > > > > > > > > > > > > > > > > Hi, Yubiao > > > > > > > > > > > > > > > > > > I agree with this idea because some users care about the > > > > production > > > > > > rate. > > > > > > > > > They don't want to unload the whole topic to fix the > > > subscription > > > > > > > > problem. > > > > > > > > > > > > > > > > > > I've got some questions: > > > > > > > > > > > > > > > > > > 1. How do you handle the race condition when you are > trying to > > > > > > unload the > > > > > > > > > subscription, and the new consumer wants to subscribe to > this > > > > > > > > subscription > > > > > > > > > at the same time? I'm unsure if it has the race condition. > I > > > just > > > > > > want to > > > > > > > > > remind you about that. :) > > > > > > > > > 2. Would you like to add some restful API design to > clarify the > > > > > > > > > implementation? > > > > > > > > > a. Request method > > > > > > > > > b. Request path > > > > > > > > > c. Response code > > > > > > > > > d. etc. > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for your work. > > > > > > > > > Mattison > > > > > > > > > On Jan 11, 2023, 17:01 +0800, Yubiao Feng < > > > > > > yubiao.f...@streamnative.io > > > > > > > > .invalid>, > > > > > > > > > wrote: > > > > > > > > > > Hi community > > > > > > > > > > > > > > > > > > > > I am starting a DISCUSS for PIP-240: A new API to unload > > > > > > subscriptions. > > > > > > > > > > > > > > > > > > > > PIP issue: https://github.com/apache/pulsar/issues/19187 > > > > > > > > > > > > > > > > > > > > ### Motivation > > > > > > > > > > > > > > > > > > > > We sometimes try to unload the topic to resolve some > > > > > > consumption-stop > > > > > > > > > > issues. But the unloading topic will also impact the > producer > > > > > side. > > > > > > > > > > > > > > > > > > > > ### Goal > > > > > > > > > > > > > > > > > > > > Providing a new API to unload the subscription dimension > > > > triggers > > > > > > > > > > reconnection of all consumers on that subscription and > > > > > > reconnection is > > > > > > > > > > guaranteed by the client. The API will be used in these > ways: > > > > > > > > > > - unload special subscription of one topic(or partitioned > > > > topic) > > > > > > > > > > - unload all subscriptions of one topic(or partitioned > topic) > > > > > > > > > > - unload subscriptions of one topic(or partitioned > topic) by > > > > > > regular > > > > > > > > > > expression > > > > > > > > > > - If a reader's subscription name is not set, a random > > > > > subscription > > > > > > > > name > > > > > > > > > > prefixed with 'multiTopicsReader-' or 'reader-' will be > used, > > > > and > > > > > > users > > > > > > > > > can > > > > > > > > > > uninstall these subscriptions using regular expressions. > > > > > > > > > > > > > > > > > > > > In addition to triggering consumer disconnection, > Unloading > > > > > > Subscribers > > > > > > > > > > will restart the Dispatcher, which resets the redeliver > > > message > > > > > > queue > > > > > > > > and > > > > > > > > > > delayed message queue in the Broker's memory, which can > help > > > > > > resolve > > > > > > > > > issues > > > > > > > > > > caused by an abnormal dispatcher state. However, the > > > execution > > > > > > flow of > > > > > > > > > > Unloading Subscribers does not include a restart of the > > > Managed > > > > > > Cursor > > > > > > > > > > related to this dispatcher; if there is a problem with > the > > > > > cursor, > > > > > > we > > > > > > > > can > > > > > > > > > > only rely on the unload topic to solve it. > > > > > > > > > > > > > > > > > > > > Note: From the client's perspective, this connection may > be > > > > > shared > > > > > > by > > > > > > > > > > consumers, producers, and transactions, so Unloading > > > > Subscribers > > > > > > maybe > > > > > > > > > > impact the producer and transaction. > > > > > > > > > > > > > > > > > > > > #### These scenarios are not supported > > > > > > > > > > - Functions `message-dedup`, `geo-replication,` and > > > > > `shadow-topic` > > > > > > also > > > > > > > > > > read messages from the topic, but Unloading subscribers > will > > > > not > > > > > > > > support > > > > > > > > > > triggering restarts of these three functions( because the > > > > cursor > > > > > is > > > > > > > > used > > > > > > > > > > directly to read the data in these scenarios, not the > > > consumer > > > > or > > > > > > > > reader > > > > > > > > > ). > > > > > > > > > > - The Compression task(subscription name is > `__compaction`) > > > > also > > > > > > use a > > > > > > > > > > reader to read data, but Unloading Subscribers does not > > > support > > > > > it > > > > > > > > > because > > > > > > > > > > this task creates a new reader each time it starts. > > > > > > > > > > - Do not support all topics related to Transaction > features. > > > > > > > > > > - `__transaction_buffer_snapshot` works with the task TB > > > > recover, > > > > > > and > > > > > > > > > > this task will create a new reader each time they start. > > > > > > > > > > - `__transaction_pending_ack` works with the task > Transaction > > > > > > Pending > > > > > > > > Ack > > > > > > > > > > Store replay, and this task will use managed cursor > directly > > > to > > > > > > read > > > > > > > > > data. > > > > > > > > > > - `__transaction_log_xxx` works with the task Transaction > > > Log, > > > > > > which > > > > > > > > will > > > > > > > > > > use managed cursor directly to read data. > > > > > > > > > > - `transaction_coordinator_assign` No data will be > written on > > > > > this > > > > > > > > topic. > > > > > > > > > > > > > > > > > > > > #### Special system topic supports > > > > > > > > > > The system topic `__change_events` is used to support > > > > topic-level > > > > > > > > > policies, > > > > > > > > > > there may also be some message delivery issues in this > > > > scenario, > > > > > so > > > > > > > > > > Unloading Subscribers will support this topic. > > > > > > > > > > > > > > > > > > > > ### API Changes > > > > > > > > > > > > > > > > > > > > #### For persistent topic > > > > > > > > > > ``` > > > > > > > > > > pulsar-admin persistent unload {topic_name} -s {sub_name} > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > > #### For non-persistent topic > > > > > > > > > > ``` > > > > > > > > > > pulsar-admin non-persistent unload {topic_name} -s > {sub_name} > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > > #### Explain the param `-s` > > > > > > > > > > - set param `-s` to special sub name to unload special > > > > > subscription > > > > > > > > > > - set param `-s` to `**` to unload all subscriptions > under > > > this > > > > > > topic > > > > > > > > > > - set param `-s` to `regexp` to unload a batch > subscriptions > > > > > under > > > > > > this > > > > > > > > > > topic > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Yubiao Feng > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >