Re: [Vote] KIP-588: Allow producers to recover gracefully from transaction timeouts

Boyang Chen Thu, 16 Apr 2020 20:00:13 -0700

Thanks Jason and Guozhang for the thoughts.

On Thu, Apr 16, 2020 at 6:09 PM Guozhang Wang <wangg...@gmail.com> wrote:


> For 2/3 above, originally I was not thinking that we will have a different
> exception for INVALID_PRODUCER_EPOCH and hence was thinking that in order
> to leverage KIP-360 for it, we'd have to let the broker to
> return UNKNOWN_PRODUCER_ID. I.e. we'd change the logic of partition leader
> as well to return UNKNOWN_PRODUCER_ID to let the producer try
> re-initializing the PID on the coordinator, and if it is indeed due to
> fencing, then coordinator can let the client know of the fatal error and
> then fail. In that case, then we do need to bump up the producer API
> version so that the partition leader knows if it is from older or newer
> clients: if it is older client who do not have KIP-360, we'd return
> INVALID_PRODUCER_EPOCH still; for newer client, we can
> return UNKNOWN_PRODUCER_ID to let it seek what's the truth on coordinator.
>
I know this is a bit of reversed order. Feel free to check out my reply to
Jason first and go back here :)
I think the partition leader will have no change of returned code after we
discussed that only new client should be able to retry.


> Now since we are additionally mapping INVALID_PRODUCER_EPOCH to a different
> error code now and letting clients to handle that similar
> to UNKNOWN_PRODUCER_ID, this can be saved indeed. However, it also means
> that 2.5.0 clients would not get benefited from this KIP, since they still
> treat INVALID_PRODUCER_EPOCH as fatal. If people this that's okay then we
> can simplify the partition leader behavior as well as not bumping up
> producer APIs. I think I'm a bit inclined towards simplicity over
> benefiting older clients.
>

2.5 client should get benefits if we return UNKNOWN_PRODUCER_ID as I have
described below.


> Guozhang
>
> On Thu, Apr 16, 2020 at 5:37 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Boyang,
> >
> > A few minor questions below:
> >
> > 1. You mention UNKNOWN_PRODUCER_ID in 2.a under Resilience Improvements.
> I
> > assume that should be INVALID_PRODUCER_EPOCH? I am not sure this case
> makes
> > sense for 2.5 clients which would view this error as fatal regardless of
> > whatever the broker does. Not sure there's anything we can do about that.
> > Similarly, if a newer client is talking to a 2.5 broker, it wouldn't be
> > able to bump the epoch after a timeout because the broker would not know
> to
> > keep the last epoch. Unfortunately, I think the only improvements that
> are
> > possible here are newer clients talking to newer brokers, but I might
> have
> > missed something.
> >
>
The reasoning for returning UNKNOWN_PRODUCER_ID is to trigger the retry
logic on 2.5 client hopefully.
As stated in the KIP, the benefited parties are *2.5 client and 2.6
client *only. If
transaction coordinator returns INVALID_PRODUCER_EPOCH for a transaction
timeout case, 2.5 client still has to treat as fatal as it only recognizes
PRODUCER_FENCED. Returning UNKNOWN_PRODUCER_ID for 2.5 client is a
substitute for TRANSACTION_TIMED_OUT error code, think of 9656
<https://issues.apache.org/jira/browse/KAFKA-9656> as a similar motivation.


> > 2. The proposal says that newer clients talking to older brokers should
> > treat PRODUCER_FENCED as non-fatal. Just to clarify, older brokers will
> > only return INVALID_PRODUCER_EPOCH because the PRODUCER_FENCED error did
> > not exist. I think the logic should be something like this.
> >
> > On the Transaction Coordinator:
> > - For old request versions, we continue returning INVALID_PRODUCER_EPOCH.
> > Clients will translate this to PRODUCER_FENCED as they do today and treat
> > this as a fatal error.
> > - For new request versions, we return either PRODUCER_FENCED or
> > TRANSACTION_TIMED_OUT depending on the case. Clients will treat the first
> > as fatal and will bump the epoch on the latter.
> >
> > On the partition leader:
> > - Nothing changes. Old clients will treat INVALID_PRODUCER_EPOCH as
> fatal.
> > New clients will attempt to bump the epoch if the right version of
> > InitProducerId is supported and fail otherwise.
> >
> > Does that seem right?
> >
>
As I stated in the previous answer, the txn coordinator could also choose
to return UNKNOWN_PRODUCER_ID for 2.5 clients or older
when there is a txn timeout. The rest looks good to me.


> > 3. Why do we need the bump to the Produce API? As far as I can tell, we
> are
> > not changing any errors that are used. The INVALID_PRODUCER_EPOCH error
> is
> > already possible today and the two new errors cannot be returned from
> > Produce.
> >
>
Guozhang had some thoughts about it. I think one vague point we haven't
clarified is what's the expected error code for server and client. The
relation should look like:
old client INVALID_PRODUCER_EPOCH -> 47 (ProducerFencedException)
new client INVALID_PRODUCER_EPOCH -> 47 (InvalidEpochException),
PRODUCER_FENCED -> 91 (ProducerFencedException)
old broker INVALID_PRODUCER_EPOCH -> 47 (ProducerFencedException)
new broker INVALID_PRODUCER_EPOCH -> 47 (InvalidEpochException),
PRODUCER_FENCED -> 91 (ProducerFencedException)

As we could see here, for a new partition leader, we shall return code 47
no matter what as stated in the KIP that PRODUCER_FENCED is not a valid
case reported from partition leader. So new client should always trigger
the retry logic for partition leader error. I guess the Produce API bump is
not necessary then.


> > 4. The javadoc examples suggest the user should call `initTransactions()`
> > in the case that a timeout is reached. I don't think that's right. For
> > KIP-360, we bump the epoch internally. I think we'd want to do the same
> > here so that we continue the current pattern where `initTransactions()`
> is
> > used just once.
> >
>
That's correct observation, will address it.


> > Thanks,
> > Jason
> >
> > On Thu, Apr 16, 2020 at 2:24 PM Guozhang Wang <wangg...@gmail.com>
> wrote:
> >
> > > +1 (binding), thanks!
> > >
> > > On Tue, Apr 14, 2020 at 4:36 PM Boyang Chen <
> reluctanthero...@gmail.com>
> > > wrote:
> > >
> > > > Hey all,
> > > >
> > > > I would like to start the vote for KIP-588:
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-588
> > > > %3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
> > > >
> > > > Feel free to continue posting on discussion thread if you have
> > > > any questions, thanks!
> > > >
> > > > Best,
> > > > Boyang
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [Vote] KIP-588: Allow producers to recover gracefully from transaction timeouts

Reply via email to