Thanks Jason and Guozhang for the thoughts. On Thu, Apr 16, 2020 at 6:09 PM Guozhang Wang <wangg...@gmail.com> wrote:
> For 2/3 above, originally I was not thinking that we will have a different > exception for INVALID_PRODUCER_EPOCH and hence was thinking that in order > to leverage KIP-360 for it, we'd have to let the broker to > return UNKNOWN_PRODUCER_ID. I.e. we'd change the logic of partition leader > as well to return UNKNOWN_PRODUCER_ID to let the producer try > re-initializing the PID on the coordinator, and if it is indeed due to > fencing, then coordinator can let the client know of the fatal error and > then fail. In that case, then we do need to bump up the producer API > version so that the partition leader knows if it is from older or newer > clients: if it is older client who do not have KIP-360, we'd return > INVALID_PRODUCER_EPOCH still; for newer client, we can > return UNKNOWN_PRODUCER_ID to let it seek what's the truth on coordinator. > I know this is a bit of reversed order. Feel free to check out my reply to Jason first and go back here :) I think the partition leader will have no change of returned code after we discussed that only new client should be able to retry. > Now since we are additionally mapping INVALID_PRODUCER_EPOCH to a different > error code now and letting clients to handle that similar > to UNKNOWN_PRODUCER_ID, this can be saved indeed. However, it also means > that 2.5.0 clients would not get benefited from this KIP, since they still > treat INVALID_PRODUCER_EPOCH as fatal. If people this that's okay then we > can simplify the partition leader behavior as well as not bumping up > producer APIs. I think I'm a bit inclined towards simplicity over > benefiting older clients. > 2.5 client should get benefits if we return UNKNOWN_PRODUCER_ID as I have described below. > Guozhang > > On Thu, Apr 16, 2020 at 5:37 PM Jason Gustafson <ja...@confluent.io> > wrote: > > > Hi Boyang, > > > > A few minor questions below: > > > > 1. You mention UNKNOWN_PRODUCER_ID in 2.a under Resilience Improvements. > I > > assume that should be INVALID_PRODUCER_EPOCH? I am not sure this case > makes > > sense for 2.5 clients which would view this error as fatal regardless of > > whatever the broker does. Not sure there's anything we can do about that. > > Similarly, if a newer client is talking to a 2.5 broker, it wouldn't be > > able to bump the epoch after a timeout because the broker would not know > to > > keep the last epoch. Unfortunately, I think the only improvements that > are > > possible here are newer clients talking to newer brokers, but I might > have > > missed something. > > > The reasoning for returning UNKNOWN_PRODUCER_ID is to trigger the retry logic on 2.5 client hopefully. As stated in the KIP, the benefited parties are *2.5 client and 2.6 client *only. If transaction coordinator returns INVALID_PRODUCER_EPOCH for a transaction timeout case, 2.5 client still has to treat as fatal as it only recognizes PRODUCER_FENCED. Returning UNKNOWN_PRODUCER_ID for 2.5 client is a substitute for TRANSACTION_TIMED_OUT error code, think of 9656 <https://issues.apache.org/jira/browse/KAFKA-9656> as a similar motivation. > > 2. The proposal says that newer clients talking to older brokers should > > treat PRODUCER_FENCED as non-fatal. Just to clarify, older brokers will > > only return INVALID_PRODUCER_EPOCH because the PRODUCER_FENCED error did > > not exist. I think the logic should be something like this. > > > > On the Transaction Coordinator: > > - For old request versions, we continue returning INVALID_PRODUCER_EPOCH. > > Clients will translate this to PRODUCER_FENCED as they do today and treat > > this as a fatal error. > > - For new request versions, we return either PRODUCER_FENCED or > > TRANSACTION_TIMED_OUT depending on the case. Clients will treat the first > > as fatal and will bump the epoch on the latter. > > > > On the partition leader: > > - Nothing changes. Old clients will treat INVALID_PRODUCER_EPOCH as > fatal. > > New clients will attempt to bump the epoch if the right version of > > InitProducerId is supported and fail otherwise. > > > > Does that seem right? > > > As I stated in the previous answer, the txn coordinator could also choose to return UNKNOWN_PRODUCER_ID for 2.5 clients or older when there is a txn timeout. The rest looks good to me. > > 3. Why do we need the bump to the Produce API? As far as I can tell, we > are > > not changing any errors that are used. The INVALID_PRODUCER_EPOCH error > is > > already possible today and the two new errors cannot be returned from > > Produce. > > > Guozhang had some thoughts about it. I think one vague point we haven't clarified is what's the expected error code for server and client. The relation should look like: old client INVALID_PRODUCER_EPOCH -> 47 (ProducerFencedException) new client INVALID_PRODUCER_EPOCH -> 47 (InvalidEpochException), PRODUCER_FENCED -> 91 (ProducerFencedException) old broker INVALID_PRODUCER_EPOCH -> 47 (ProducerFencedException) new broker INVALID_PRODUCER_EPOCH -> 47 (InvalidEpochException), PRODUCER_FENCED -> 91 (ProducerFencedException) As we could see here, for a new partition leader, we shall return code 47 no matter what as stated in the KIP that PRODUCER_FENCED is not a valid case reported from partition leader. So new client should always trigger the retry logic for partition leader error. I guess the Produce API bump is not necessary then. > > 4. The javadoc examples suggest the user should call `initTransactions()` > > in the case that a timeout is reached. I don't think that's right. For > > KIP-360, we bump the epoch internally. I think we'd want to do the same > > here so that we continue the current pattern where `initTransactions()` > is > > used just once. > > > That's correct observation, will address it. > > Thanks, > > Jason > > > > On Thu, Apr 16, 2020 at 2:24 PM Guozhang Wang <wangg...@gmail.com> > wrote: > > > > > +1 (binding), thanks! > > > > > > On Tue, Apr 14, 2020 at 4:36 PM Boyang Chen < > reluctanthero...@gmail.com> > > > wrote: > > > > > > > Hey all, > > > > > > > > I would like to start the vote for KIP-588: > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-588 > > > > %3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts > > > > > > > > Feel free to continue posting on discussion thread if you have > > > > any questions, thanks! > > > > > > > > Best, > > > > Boyang > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > -- > -- Guozhang >