Thanks a lot John for giving a review on Friday night! And thank you Guozhang, and Jason for your votes as well :)
Now that we have collected 3 binding votes (Guozhang, Jason, John), I will close the voting thread and mark the KIP as approved. Still feel free to raise any question on the mailing list for sure! Best, Boyang On Fri, Apr 24, 2020 at 8:29 PM John Roesler <vvcep...@apache.org> wrote: > Hi Boyang, > > Thanks for the KIP! I've just read it over and caught up on all the prior > discussions. > The current version of the KIP looks good to me, and I think the decisions > you've > made are reasonable. > > I'm +1 (binding) > > Thanks, > -John > > On Wed, Apr 22, 2020, at 12:12, Boyang Chen wrote: > > Hey Jason, > > > > thanks for the suggestions! Addressed in the KIP. > > > > On Wed, Apr 22, 2020 at 9:21 AM Jason Gustafson <ja...@confluent.io> > wrote: > > > > > +1 Just a couple small comments: > > > > > > 1. My comment about `initTransactions()` usage in the javadoc above > appears > > > not to have been addressed. > > > 2. For the handling of INVALID_PRODUCER_EPOCH in the produce response, > > > would we only try to abort if the broker supports the newer protocol > > > version? I guess it would be simpler in the implementation if the > producer > > > did it in any case even if it might not be useful for older versions. > > > > > > -Jason > > > > > > On Fri, Apr 17, 2020 at 3:55 PM Guozhang Wang <wangg...@gmail.com> > wrote: > > > > > > > Sounds good to me. Thanks Boyang. > > > > > > > > On Fri, Apr 17, 2020 at 3:32 PM Boyang Chen < > reluctanthero...@gmail.com> > > > > wrote: > > > > > > > > > Thanks Guozhang, > > > > > > > > > > I think most of the complexity comes from our intention to benefit > > > older > > > > > clients. After a second thought, I think the add-on complexity > > > > counteracts > > > > > the gain here as only 2.5 client is getting a slice of the > resilience > > > > > improvement, not for many older versions. > > > > > > > > > > So I decide to drop the UNKNOWN_PRODUCER_ID path, by just claiming > that > > > > > this change would only benefit 2.6 Producer clients. So the only > path > > > > that > > > > > needs version detection is the new transaction coordinator handling > > > > > transactional requests. If the Producer is 2.6+, we pick > > > > > PRODUCER_FENCED(new error code) or TRANSACTION_TIMED_OUT as the > > > response; > > > > > otherwise we return INVALID_PRODUCE_EPOCH to be consistent with > older > > > > > clients. > > > > > > > > > > Does this sound like a better plan? I already updated the KIP with > > > > > simplifications. > > > > > > > > > > > > > > > On Fri, Apr 17, 2020 at 12:02 PM Guozhang Wang <wangg...@gmail.com > > > > > > wrote: > > > > > > > > > > > Hi Boyang, > > > > > > > > > > > > Your reply to 3) seems conflicting with your other answers which > is a > > > > bit > > > > > > confusing to me. Following your other answers, it seems you > suggest > > > > > > returning UNKNOWN_PRODUCER_ID so that 2.5 clients can trigger > retry > > > > logic > > > > > > as well? > > > > > > > > > > > > To complete my reasoning here as a complete picture: > > > > > > > > > > > > a) post KIP-360 (2.5+) the partition leader broker does not > return > > > > > > UNKNOWN_PRODUCER_ID any more. > > > > > > b) upon seeing an old epoch, partition leader cannot tell if it > is > > > due > > > > to > > > > > > fencing or timeout; so it could only return > INVALID_PRODUCER_EPOCH. > > > > > > > > > > > > So the basic idea is to let the clients ask the transaction > > > coordinator > > > > > for > > > > > > the source of truth: > > > > > > > > > > > > 1) 2.5+ client would handle UNKNOWN_PRODUCER_ID (which could > only be > > > > > > returned from old brokers) by trying to re-initialize with the > > > > > transaction > > > > > > coordinator; the coordinator would then tell it whether it is > > > > > > PRODUCER_FENCED or TXN_TIMEOUT. And for old brokers, it would > always > > > > > return > > > > > > PRODUCER_FENCED anyways. > > > > > > 2) 2.6+ client would also handle INVALID_PRODUCER_EPOCH with the > > > retry > > > > > > initializing logic; and similarly the transaction coordinator > would > > > > > > return PRODUCER_FENCED or TXN_TIMEOUT if it is new or always > > > > > > return PRODUCER_FENCED if it is old. > > > > > > > > > > > > The question open is, whether > > > > > > > > > > > > * 3) the new broker should return UNKNOWN_PRODUCER_ID now when > it is > > > > > > *supposed* to return INVALID_PRODUCER_EPOCH and it found the > request > > > is > > > > > > from 2.5 client (note as mentioned in a) right now we do not > > > > > > return UNKNOWN_PRODUCER_ID from brokers anymore). > > > > > > > > > > > > If it does, then 2.5 client could still do the retry logic to the > > > > > > transaction coordinator, i.e. benefit from KIP-360; but the cost > is > > > > > complex > > > > > > logic on the broker side as well as producer API version bump up. > > > > > > If it does not, then when INVALID_PRODUCER_EPOCH is returned to > the > > > old > > > > > > client it would treat it as fatal and not ask the txn > coordinator; > > > but > > > > it > > > > > > simplifies the broker logic and also do not require producer API > > > > version > > > > > > bump. > > > > > > > > > > > > Personally I'd suggest we do the latter, knowing that it would > not > > > > > benefit > > > > > > 2.5 client when the partition leader gets an old epoch and does > not > > > > know > > > > > > whether it is Fenced or Timed Out. > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > On Thu, Apr 16, 2020 at 7:59 PM Boyang Chen < > > > > reluctanthero...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Thanks Jason and Guozhang for the thoughts. > > > > > > > > > > > > > > On Thu, Apr 16, 2020 at 6:09 PM Guozhang Wang < > wangg...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > For 2/3 above, originally I was not thinking that we will > have a > > > > > > > different > > > > > > > > exception for INVALID_PRODUCER_EPOCH and hence was thinking > that > > > in > > > > > > order > > > > > > > > to leverage KIP-360 for it, we'd have to let the broker to > > > > > > > > return UNKNOWN_PRODUCER_ID. I.e. we'd change the logic of > > > partition > > > > > > > leader > > > > > > > > as well to return UNKNOWN_PRODUCER_ID to let the producer try > > > > > > > > re-initializing the PID on the coordinator, and if it is > indeed > > > due > > > > > to > > > > > > > > fencing, then coordinator can let the client know of the > fatal > > > > error > > > > > > and > > > > > > > > then fail. In that case, then we do need to bump up the > producer > > > > API > > > > > > > > version so that the partition leader knows if it is from > older or > > > > > newer > > > > > > > > clients: if it is older client who do not have KIP-360, we'd > > > return > > > > > > > > INVALID_PRODUCER_EPOCH still; for newer client, we can > > > > > > > > return UNKNOWN_PRODUCER_ID to let it seek what's the truth on > > > > > > > coordinator. > > > > > > > > > > > > > > > I know this is a bit of reversed order. Feel free to check out > my > > > > reply > > > > > > to > > > > > > > Jason first and go back here :) > > > > > > > I think the partition leader will have no change of returned > code > > > > after > > > > > > we > > > > > > > discussed that only new client should be able to retry. > > > > > > > > > > > > > > > > > > > > > > Now since we are additionally mapping INVALID_PRODUCER_EPOCH > to a > > > > > > > different > > > > > > > > error code now and letting clients to handle that similar > > > > > > > > to UNKNOWN_PRODUCER_ID, this can be saved indeed. However, it > > > also > > > > > > means > > > > > > > > that 2.5.0 clients would not get benefited from this KIP, > since > > > > they > > > > > > > still > > > > > > > > treat INVALID_PRODUCER_EPOCH as fatal. If people this that's > okay > > > > > then > > > > > > we > > > > > > > > can simplify the partition leader behavior as well as not > bumping > > > > up > > > > > > > > producer APIs. I think I'm a bit inclined towards simplicity > over > > > > > > > > benefiting older clients. > > > > > > > > > > > > > > > > > > > > > > 2.5 client should get benefits if we return > UNKNOWN_PRODUCER_ID as > > > I > > > > > have > > > > > > > described below. > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > On Thu, Apr 16, 2020 at 5:37 PM Jason Gustafson < > > > > ja...@confluent.io> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Boyang, > > > > > > > > > > > > > > > > > > A few minor questions below: > > > > > > > > > > > > > > > > > > 1. You mention UNKNOWN_PRODUCER_ID in 2.a under Resilience > > > > > > > Improvements. > > > > > > > > I > > > > > > > > > assume that should be INVALID_PRODUCER_EPOCH? I am not sure > > > this > > > > > case > > > > > > > > makes > > > > > > > > > sense for 2.5 clients which would view this error as fatal > > > > > regardless > > > > > > > of > > > > > > > > > whatever the broker does. Not sure there's anything we can > do > > > > about > > > > > > > that. > > > > > > > > > Similarly, if a newer client is talking to a 2.5 broker, it > > > > > wouldn't > > > > > > be > > > > > > > > > able to bump the epoch after a timeout because the broker > would > > > > not > > > > > > > know > > > > > > > > to > > > > > > > > > keep the last epoch. Unfortunately, I think the only > > > improvements > > > > > > that > > > > > > > > are > > > > > > > > > possible here are newer clients talking to newer brokers, > but I > > > > > might > > > > > > > > have > > > > > > > > > missed something. > > > > > > > > > > > > > > > > > > > > > > > > The reasoning for returning UNKNOWN_PRODUCER_ID is to trigger > the > > > > retry > > > > > > > logic on 2.5 client hopefully. > > > > > > > As stated in the KIP, the benefited parties are *2.5 client > and 2.6 > > > > > > > client *only. If > > > > > > > transaction coordinator returns INVALID_PRODUCER_EPOCH for a > > > > > transaction > > > > > > > timeout case, 2.5 client still has to treat as fatal as it only > > > > > > recognizes > > > > > > > PRODUCER_FENCED. Returning UNKNOWN_PRODUCER_ID for 2.5 client > is a > > > > > > > substitute for TRANSACTION_TIMED_OUT error code, think of 9656 > > > > > > > <https://issues.apache.org/jira/browse/KAFKA-9656> as a > similar > > > > > > > motivation. > > > > > > > > > > > > > > > > > > > > > > > 2. The proposal says that newer clients talking to older > > > brokers > > > > > > should > > > > > > > > > treat PRODUCER_FENCED as non-fatal. Just to clarify, older > > > > brokers > > > > > > will > > > > > > > > > only return INVALID_PRODUCER_EPOCH because the > PRODUCER_FENCED > > > > > error > > > > > > > did > > > > > > > > > not exist. I think the logic should be something like this. > > > > > > > > > > > > > > > > > > On the Transaction Coordinator: > > > > > > > > > - For old request versions, we continue returning > > > > > > > INVALID_PRODUCER_EPOCH. > > > > > > > > > Clients will translate this to PRODUCER_FENCED as they do > today > > > > and > > > > > > > treat > > > > > > > > > this as a fatal error. > > > > > > > > > - For new request versions, we return either > PRODUCER_FENCED or > > > > > > > > > TRANSACTION_TIMED_OUT depending on the case. Clients will > treat > > > > the > > > > > > > first > > > > > > > > > as fatal and will bump the epoch on the latter. > > > > > > > > > > > > > > > > > > On the partition leader: > > > > > > > > > - Nothing changes. Old clients will treat > > > INVALID_PRODUCER_EPOCH > > > > as > > > > > > > > fatal. > > > > > > > > > New clients will attempt to bump the epoch if the right > version > > > > of > > > > > > > > > InitProducerId is supported and fail otherwise. > > > > > > > > > > > > > > > > > > Does that seem right? > > > > > > > > > > > > > > > > > > > > > > > > As I stated in the previous answer, the txn coordinator could > also > > > > > choose > > > > > > > to return UNKNOWN_PRODUCER_ID for 2.5 clients or older > > > > > > > when there is a txn timeout. The rest looks good to me. > > > > > > > > > > > > > > > > > > > > > > > 3. Why do we need the bump to the Produce API? As far as I > can > > > > > tell, > > > > > > we > > > > > > > > are > > > > > > > > > not changing any errors that are used. The > > > INVALID_PRODUCER_EPOCH > > > > > > error > > > > > > > > is > > > > > > > > > already possible today and the two new errors cannot be > > > returned > > > > > from > > > > > > > > > Produce. > > > > > > > > > > > > > > > > > > > > > > > > Guozhang had some thoughts about it. I think one vague point we > > > > haven't > > > > > > > clarified is what's the expected error code for server and > client. > > > > The > > > > > > > relation should look like: > > > > > > > old client INVALID_PRODUCER_EPOCH -> 47 > (ProducerFencedException) > > > > > > > new client INVALID_PRODUCER_EPOCH -> 47 > (InvalidEpochException), > > > > > > > PRODUCER_FENCED -> 91 (ProducerFencedException) > > > > > > > old broker INVALID_PRODUCER_EPOCH -> 47 > (ProducerFencedException) > > > > > > > new broker INVALID_PRODUCER_EPOCH -> 47 > (InvalidEpochException), > > > > > > > PRODUCER_FENCED -> 91 (ProducerFencedException) > > > > > > > > > > > > > > As we could see here, for a new partition leader, we shall > return > > > > code > > > > > 47 > > > > > > > no matter what as stated in the KIP that PRODUCER_FENCED is > not a > > > > valid > > > > > > > case reported from partition leader. So new client should > always > > > > > trigger > > > > > > > the retry logic for partition leader error. I guess the > Produce API > > > > > bump > > > > > > is > > > > > > > not necessary then. > > > > > > > > > > > > > > > > > > > > > > > 4. The javadoc examples suggest the user should call > > > > > > > `initTransactions()` > > > > > > > > > in the case that a timeout is reached. I don't think that's > > > > right. > > > > > > For > > > > > > > > > KIP-360, we bump the epoch internally. I think we'd want > to do > > > > the > > > > > > same > > > > > > > > > here so that we continue the current pattern where > > > > > > `initTransactions()` > > > > > > > > is > > > > > > > > > used just once. > > > > > > > > > > > > > > > > > > > > > > > > That's correct observation, will address it. > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > On Thu, Apr 16, 2020 at 2:24 PM Guozhang Wang < > > > > wangg...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > +1 (binding), thanks! > > > > > > > > > > > > > > > > > > > > On Tue, Apr 14, 2020 at 4:36 PM Boyang Chen < > > > > > > > > reluctanthero...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hey all, > > > > > > > > > > > > > > > > > > > > > > I would like to start the vote for KIP-588: > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-588 > > > > > > > > > > > > > > > > > > %3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts > > > > > > > > > > > > > > > > > > > > > > Feel free to continue posting on discussion thread if > you > > > > have > > > > > > > > > > > any questions, thanks! > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Boyang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > >