For 2/3 above, originally I was not thinking that we will have a different exception for INVALID_PRODUCER_EPOCH and hence was thinking that in order to leverage KIP-360 for it, we'd have to let the broker to return UNKNOWN_PRODUCER_ID. I.e. we'd change the logic of partition leader as well to return UNKNOWN_PRODUCER_ID to let the producer try re-initializing the PID on the coordinator, and if it is indeed due to fencing, then coordinator can let the client know of the fatal error and then fail. In that case, then we do need to bump up the producer API version so that the partition leader knows if it is from older or newer clients: if it is older client who do not have KIP-360, we'd return INVALID_PRODUCER_EPOCH still; for newer client, we can return UNKNOWN_PRODUCER_ID to let it seek what's the truth on coordinator.
Now since we are additionally mapping INVALID_PRODUCER_EPOCH to a different error code now and letting clients to handle that similar to UNKNOWN_PRODUCER_ID, this can be saved indeed. However, it also means that 2.5.0 clients would not get benefited from this KIP, since they still treat INVALID_PRODUCER_EPOCH as fatal. If people this that's okay then we can simplify the partition leader behavior as well as not bumping up producer APIs. I think I'm a bit inclined towards simplicity over benefiting older clients. Guozhang On Thu, Apr 16, 2020 at 5:37 PM Jason Gustafson <ja...@confluent.io> wrote: > Hi Boyang, > > A few minor questions below: > > 1. You mention UNKNOWN_PRODUCER_ID in 2.a under Resilience Improvements. I > assume that should be INVALID_PRODUCER_EPOCH? I am not sure this case makes > sense for 2.5 clients which would view this error as fatal regardless of > whatever the broker does. Not sure there's anything we can do about that. > Similarly, if a newer client is talking to a 2.5 broker, it wouldn't be > able to bump the epoch after a timeout because the broker would not know to > keep the last epoch. Unfortunately, I think the only improvements that are > possible here are newer clients talking to newer brokers, but I might have > missed something. > > 2. The proposal says that newer clients talking to older brokers should > treat PRODUCER_FENCED as non-fatal. Just to clarify, older brokers will > only return INVALID_PRODUCER_EPOCH because the PRODUCER_FENCED error did > not exist. I think the logic should be something like this. > > On the Transaction Coordinator: > - For old request versions, we continue returning INVALID_PRODUCER_EPOCH. > Clients will translate this to PRODUCER_FENCED as they do today and treat > this as a fatal error. > - For new request versions, we return either PRODUCER_FENCED or > TRANSACTION_TIMED_OUT depending on the case. Clients will treat the first > as fatal and will bump the epoch on the latter. > > On the partition leader: > - Nothing changes. Old clients will treat INVALID_PRODUCER_EPOCH as fatal. > New clients will attempt to bump the epoch if the right version of > InitProducerId is supported and fail otherwise. > > Does that seem right? > > 3. Why do we need the bump to the Produce API? As far as I can tell, we are > not changing any errors that are used. The INVALID_PRODUCER_EPOCH error is > already possible today and the two new errors cannot be returned from > Produce. > > 4. The javadoc examples suggest the user should call `initTransactions()` > in the case that a timeout is reached. I don't think that's right. For > KIP-360, we bump the epoch internally. I think we'd want to do the same > here so that we continue the current pattern where `initTransactions()` is > used just once. > > Thanks, > Jason > > On Thu, Apr 16, 2020 at 2:24 PM Guozhang Wang <wangg...@gmail.com> wrote: > > > +1 (binding), thanks! > > > > On Tue, Apr 14, 2020 at 4:36 PM Boyang Chen <reluctanthero...@gmail.com> > > wrote: > > > > > Hey all, > > > > > > I would like to start the vote for KIP-588: > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-588 > > > %3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts > > > > > > Feel free to continue posting on discussion thread if you have > > > any questions, thanks! > > > > > > Best, > > > Boyang > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang