Hi Qiang,

We have a brainstorming session on this PIP over Zoom with Penghui, Hang,
and more people, and I'm jotting down here our feedback.

Before I do that, I just want to write my own understanding of the
document, for other readers:

# Context
Pulsar, as opposed to other distributed / streaming systems, took the
approach of a push model. The client (consumer that is) asks for 1000
messages (that's the consumer's remaining capacity in its internal queue)
from the broker (that process is named flow permits). The broker was now
given permission to send 1000 messages to the client, hence utilizing the
TCP connection to send those 1000 messages as they were ready to be sent.

The consumer has the ability to request the subscription to reset its
position to the requested new position.
The problem we have today is that while we have sent a request to reset the
subscription position, the broker decides to:
1. Close the TCP connection which in turn causes the client to clear any
pending messages it has in the queue.
2. Continue to send messages from the previous position, up to a certain
point where the broker "shifts gear" and starts sending messages from the
new position.

So the problem is that you would expect that after the connection was
reset, only messages from the new position will be sent to the consumer,
but that doesn't happen.

We have to keep in mind, that we have effectively two scenarios here from
the point of view of the consumer:
1. Single consumer - It can be due to using an Exclusive subscription, or
being a consumer attached to a single topic since the subscription is of
type Failover.
2. Multiple consumers - In a Shared or Key Shared subscription types. In
this case, one of those consumers can decide to reset the position of the
*subscription*. When that happens, the broker decides, again, to reset all
existing TCP connections to all consumers upon receiving the seek command,
and you would expect any messages sent afterward to be from the new
position, which again doesn't happen.

Another really important piece of information we need to bring to the
context of the reader here is the notion of an epoch. First, the epoch in
Pulsar PIPs was introduced in PIP-84. The idea is that every time the
client starts a "session" of requesting and receiving messages in response,
the client will send a Session Sequence Number, and the server responds to
those message requests with the same session sequence number. Since Pulsar
doesn't follow a request-response model but has a bi-directional protocol,
the client can send a command to fetch messages using a new session
sequence number, while the server can still send messages using the old
session number. Using the Session Sequence Number the client can't tell the
difference between the messages being pushed from the server to it. That
Session Sequence Number has the one referred to as Epoch in PIP-84 and also
in this PIP.
The idea was somehow to demarcate the responses coming from the server
based on the commands the client sends as they are *independent* (async).

# What are the issues with this PIP?
1. The PIP decides to solve the problem listed above *only* for exclusive
and failover subscriptions where you have only a single consumer. The
problem still remains at large with Shared or Key Shared subscriptions.
2. The cost of solving a small portion of the problem is high:
    Added Complexity - Adding another field to the protocol, and another
thing to check. I believe we should aim to reduce the cognitive load of the
developers of Pulsar.
3. There are no rejected solutions - We always need to examine all
available options and list why we decided against them.
4. Lack of background knowledge (context) - it's super hard IMO to grasp
the idea without so much context missing: The client-server protocol
pertaining to this PIP, including its async nature, what is an epoch and
why it was introduced, what are flow permits. I'm not saying explain all
pulsar in this doc, but just include a brief explanation of that
terminology.

# What We Suggest

Rethink the solution.
1. The consumer (one of many) will send a seek command to the broker, and
at the same time clear its internal queue and wait for a response from the
broker.
2. The broker upon receiving the seek command, will
     a. Stop sending dispatching messages to consumers.
     b. Notify all consumers via a command (new) that the subscription
position was asked to be reset. Consumers receiving this command will clear
their internal queue. The broker will no longer close the TCP connection
(with its adverse effects on other consumers and produces "riding" on that
connection)
     c. Reset the cursor to the newly requested position.
     d. Continue dispatching messages from newly requested positions to
consumers.

The disadvantages here are that we need to alter the client to get to know
a new command and act accordingly, yet I think that is accidental
complexity stemming from the client-server architecture of bi-directional
and not request response.

Thanks,

Asaf

On Mon, Aug 1, 2022 at 6:43 AM Qiang Huang <qiang.huang1...@gmail.com>
wrote:

> Sure. You can refer to pip-84:
>
> https://github.com/apache/pulsar/wiki/PIP-84-:-Pulsar-client:-Redeliver-command-add-epoch
> .
>
> Zike Yang <z...@apache.org> 于2022年7月29日周五 10:22写道:
>
> > Hi, Qiang
> >
> > > It is necessary to check the current cursor status when handling
> > flowPermits
> > > request from the server side. If the server is handling seek request,
> it
> > > should ignore flowPermits request because the request is illegal.
> >
> > Thanks for your explanation. I think it's better to add this
> > explanation to the PIP.
> >
> > > The reconnected consumer can regard as a new consumer with new epoch.
> >
> > The consumer will reconnect to the broker during the seek operation.
> > And this will change the existing behavior. It doesn't seem to make
> > sense. Please correct me if I have misunderstood.
> >
> > Thanks,
> > Zike Yang
> >
> > On Wed, Jul 27, 2022 at 8:06 PM Qiang Huang <qiang.huang1...@gmail.com>
> > wrote:
> > >
> > > Thanks Zike.
> > > > > - stage 1: Check the current cursor status when handling
> flowPermits
> > > from
> > > > > the server side.
> > >
> > > > > Could you explain more details on this step? It looks like there is
> > > not much described above. What kind of status needs to be checked, and
> > > what kind of behavior will the broker take?
> > > It is necessary to check the current cursor status when handling
> > flowPermits
> > > request from the server side. If the server is handling seek request,
> it
> > > should ignore flowPermits request because the request is illegal.
> > >
> > >
> > > > > 1. Consumer reconnect need reset epoch.
> > > >> Why do we need to reset the epoch when the consumer reconnects?
> > > The reconnected consumer can regard as a new consumer with new epoch.
> >
>
>
> --
> BR,
> Qiang Huang
>

Reply via email to