Re: [Vote] KIP-588: Allow producers to recover gracefully from transaction timeouts

Boyang Chen Mon, 27 Apr 2020 09:51:39 -0700

Hey all,

we want to piggy-back one more protocol change to this KIP, which is the
InitProducerId. Previously we overlooked the case returned from InitPid
where the error code is still INVALID_PRODUCER_EPOCH instead of
PRODUCER_FENCED. To be consistent, we will bump this protocol and always
return PRODUCER_FENCED for new request versions.


Let me know if you have any questions, the KIP is already updated.

Boyang

On Fri, Apr 24, 2020 at 8:44 PM Boyang Chen <reluctanthero...@gmail.com>
wrote:

> Thanks a lot John for giving a review on Friday night! And thank you
> Guozhang, and Jason for your votes as well :)
>
> Now that we have collected 3 binding votes (Guozhang, Jason, John), I will
> close the voting thread and mark the KIP as approved. Still feel free to
> raise any question on the mailing list for sure!
>
> Best,
> Boyang
>
> On Fri, Apr 24, 2020 at 8:29 PM John Roesler <vvcep...@apache.org> wrote:
>
>> Hi Boyang,
>>
>> Thanks for the KIP! I've just read it over and caught up on all the prior
>> discussions.
>> The current version of the KIP looks good to me, and I think the
>> decisions you've
>> made are reasonable.
>>
>> I'm +1 (binding)
>>
>> Thanks,
>> -John
>>
>> On Wed, Apr 22, 2020, at 12:12, Boyang Chen wrote:
>> > Hey Jason,
>> >
>> > thanks for the suggestions! Addressed in the KIP.
>> >
>> > On Wed, Apr 22, 2020 at 9:21 AM Jason Gustafson <ja...@confluent.io>
>> wrote:
>> >
>> > > +1 Just a couple small comments:
>> > >
>> > > 1. My comment about `initTransactions()` usage in the javadoc above
>> appears
>> > > not to have been addressed.
>> > > 2. For the handling of INVALID_PRODUCER_EPOCH in the produce response,
>> > > would we only try to abort if the broker supports the newer protocol
>> > > version? I guess it would be simpler in the implementation if the
>> producer
>> > > did it in any case even if it might not be useful for older versions.
>> > >
>> > > -Jason
>> > >
>> > > On Fri, Apr 17, 2020 at 3:55 PM Guozhang Wang <wangg...@gmail.com>
>> wrote:
>> > >
>> > > > Sounds good to me. Thanks Boyang.
>> > > >
>> > > > On Fri, Apr 17, 2020 at 3:32 PM Boyang Chen <
>> reluctanthero...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Thanks Guozhang,
>> > > > >
>> > > > > I think most of the complexity comes from our intention to benefit
>> > > older
>> > > > > clients. After a second thought, I think the add-on complexity
>> > > > counteracts
>> > > > > the gain here as only 2.5 client is getting a slice of the
>> resilience
>> > > > > improvement, not for many older versions.
>> > > > >
>> > > > > So I decide to drop the UNKNOWN_PRODUCER_ID path, by just
>> claiming that
>> > > > > this change would only benefit 2.6 Producer clients. So the only
>> path
>> > > > that
>> > > > > needs version detection is the new transaction coordinator
>> handling
>> > > > > transactional requests. If the Producer is 2.6+, we pick
>> > > > > PRODUCER_FENCED(new error code) or TRANSACTION_TIMED_OUT as the
>> > > response;
>> > > > > otherwise  we return INVALID_PRODUCE_EPOCH to be consistent with
>> older
>> > > > > clients.
>> > > > >
>> > > > > Does this sound like a better plan? I already updated the KIP with
>> > > > > simplifications.
>> > > > >
>> > > > >
>> > > > > On Fri, Apr 17, 2020 at 12:02 PM Guozhang Wang <
>> wangg...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > > Hi Boyang,
>> > > > > >
>> > > > > > Your reply to 3) seems conflicting with your other answers
>> which is a
>> > > > bit
>> > > > > > confusing to me. Following your other answers, it seems you
>> suggest
>> > > > > > returning UNKNOWN_PRODUCER_ID so that 2.5 clients can trigger
>> retry
>> > > > logic
>> > > > > > as well?
>> > > > > >
>> > > > > > To complete my reasoning here as a complete picture:
>> > > > > >
>> > > > > > a) post KIP-360 (2.5+) the partition leader broker does not
>> return
>> > > > > > UNKNOWN_PRODUCER_ID any more.
>> > > > > > b) upon seeing an old epoch, partition leader cannot tell if it
>> is
>> > > due
>> > > > to
>> > > > > > fencing or timeout; so it could only return
>> INVALID_PRODUCER_EPOCH.
>> > > > > >
>> > > > > > So the basic idea is to let the clients ask the transaction
>> > > coordinator
>> > > > > for
>> > > > > > the source of truth:
>> > > > > >
>> > > > > > 1) 2.5+ client would handle UNKNOWN_PRODUCER_ID (which could
>> only be
>> > > > > > returned from old brokers) by trying to re-initialize with the
>> > > > > transaction
>> > > > > > coordinator; the coordinator would then tell it whether it is
>> > > > > > PRODUCER_FENCED or TXN_TIMEOUT. And for old brokers, it would
>> always
>> > > > > return
>> > > > > > PRODUCER_FENCED anyways.
>> > > > > > 2) 2.6+ client would also handle INVALID_PRODUCER_EPOCH with the
>> > > retry
>> > > > > > initializing logic; and similarly the transaction coordinator
>> would
>> > > > > > return PRODUCER_FENCED or TXN_TIMEOUT if it is new or always
>> > > > > > return PRODUCER_FENCED if it is old.
>> > > > > >
>> > > > > > The question open is, whether
>> > > > > >
>> > > > > > * 3) the new broker should return UNKNOWN_PRODUCER_ID now when
>> it is
>> > > > > > *supposed* to return INVALID_PRODUCER_EPOCH and it found the
>> request
>> > > is
>> > > > > > from 2.5 client (note as mentioned in a) right now we do not
>> > > > > > return UNKNOWN_PRODUCER_ID from brokers anymore).
>> > > > > >
>> > > > > > If it does, then 2.5 client could still do the retry logic to
>> the
>> > > > > > transaction coordinator, i.e. benefit from KIP-360; but the
>> cost is
>> > > > > complex
>> > > > > > logic on the broker side as well as producer API version bump
>> up.
>> > > > > > If it does not, then when INVALID_PRODUCER_EPOCH is returned to
>> the
>> > > old
>> > > > > > client it would treat it as fatal and not ask the txn
>> coordinator;
>> > > but
>> > > > it
>> > > > > > simplifies the broker logic and also do not require producer API
>> > > > version
>> > > > > > bump.
>> > > > > >
>> > > > > > Personally I'd suggest we do the latter, knowing that it would
>> not
>> > > > > benefit
>> > > > > > 2.5 client when the partition leader gets an old epoch and does
>> not
>> > > > know
>> > > > > > whether it is Fenced or Timed Out.
>> > > > > >
>> > > > > >
>> > > > > > Guozhang
>> > > > > >
>> > > > > > On Thu, Apr 16, 2020 at 7:59 PM Boyang Chen <
>> > > > reluctanthero...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Thanks Jason and Guozhang for the thoughts.
>> > > > > > >
>> > > > > > > On Thu, Apr 16, 2020 at 6:09 PM Guozhang Wang <
>> wangg...@gmail.com>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > For 2/3 above, originally I was not thinking that we will
>> have a
>> > > > > > > different
>> > > > > > > > exception for INVALID_PRODUCER_EPOCH and hence was thinking
>> that
>> > > in
>> > > > > > order
>> > > > > > > > to leverage KIP-360 for it, we'd have to let the broker to
>> > > > > > > > return UNKNOWN_PRODUCER_ID. I.e. we'd change the logic of
>> > > partition
>> > > > > > > leader
>> > > > > > > > as well to return UNKNOWN_PRODUCER_ID to let the producer
>> try
>> > > > > > > > re-initializing the PID on the coordinator, and if it is
>> indeed
>> > > due
>> > > > > to
>> > > > > > > > fencing, then coordinator can let the client know of the
>> fatal
>> > > > error
>> > > > > > and
>> > > > > > > > then fail. In that case, then we do need to bump up the
>> producer
>> > > > API
>> > > > > > > > version so that the partition leader knows if it is from
>> older or
>> > > > > newer
>> > > > > > > > clients: if it is older client who do not have KIP-360, we'd
>> > > return
>> > > > > > > > INVALID_PRODUCER_EPOCH still; for newer client, we can
>> > > > > > > > return UNKNOWN_PRODUCER_ID to let it seek what's the truth
>> on
>> > > > > > > coordinator.
>> > > > > > > >
>> > > > > > > I know this is a bit of reversed order. Feel free to check
>> out my
>> > > > reply
>> > > > > > to
>> > > > > > > Jason first and go back here :)
>> > > > > > > I think the partition leader will have no change of returned
>> code
>> > > > after
>> > > > > > we
>> > > > > > > discussed that only new client should be able to retry.
>> > > > > > >
>> > > > > > >
>> > > > > > > > Now since we are additionally mapping
>> INVALID_PRODUCER_EPOCH to a
>> > > > > > > different
>> > > > > > > > error code now and letting clients to handle that similar
>> > > > > > > > to UNKNOWN_PRODUCER_ID, this can be saved indeed. However,
>> it
>> > > also
>> > > > > > means
>> > > > > > > > that 2.5.0 clients would not get benefited from this KIP,
>> since
>> > > > they
>> > > > > > > still
>> > > > > > > > treat INVALID_PRODUCER_EPOCH as fatal. If people this
>> that's okay
>> > > > > then
>> > > > > > we
>> > > > > > > > can simplify the partition leader behavior as well as not
>> bumping
>> > > > up
>> > > > > > > > producer APIs. I think I'm a bit inclined towards
>> simplicity over
>> > > > > > > > benefiting older clients.
>> > > > > > > >
>> > > > > > >
>> > > > > > > 2.5 client should get benefits if we return
>> UNKNOWN_PRODUCER_ID as
>> > > I
>> > > > > have
>> > > > > > > described below.
>> > > > > > >
>> > > > > > >
>> > > > > > > > Guozhang
>> > > > > > > >
>> > > > > > > > On Thu, Apr 16, 2020 at 5:37 PM Jason Gustafson <
>> > > > ja...@confluent.io>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Boyang,
>> > > > > > > > >
>> > > > > > > > > A few minor questions below:
>> > > > > > > > >
>> > > > > > > > > 1. You mention UNKNOWN_PRODUCER_ID in 2.a under Resilience
>> > > > > > > Improvements.
>> > > > > > > > I
>> > > > > > > > > assume that should be INVALID_PRODUCER_EPOCH? I am not
>> sure
>> > > this
>> > > > > case
>> > > > > > > > makes
>> > > > > > > > > sense for 2.5 clients which would view this error as fatal
>> > > > > regardless
>> > > > > > > of
>> > > > > > > > > whatever the broker does. Not sure there's anything we
>> can do
>> > > > about
>> > > > > > > that.
>> > > > > > > > > Similarly, if a newer client is talking to a 2.5 broker,
>> it
>> > > > > wouldn't
>> > > > > > be
>> > > > > > > > > able to bump the epoch after a timeout because the broker
>> would
>> > > > not
>> > > > > > > know
>> > > > > > > > to
>> > > > > > > > > keep the last epoch. Unfortunately, I think the only
>> > > improvements
>> > > > > > that
>> > > > > > > > are
>> > > > > > > > > possible here are newer clients talking to newer brokers,
>> but I
>> > > > > might
>> > > > > > > > have
>> > > > > > > > > missed something.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > The reasoning for returning UNKNOWN_PRODUCER_ID is to trigger
>> the
>> > > > retry
>> > > > > > > logic on 2.5 client hopefully.
>> > > > > > > As stated in the KIP, the benefited parties are *2.5 client
>> and 2.6
>> > > > > > > client *only. If
>> > > > > > > transaction coordinator returns INVALID_PRODUCER_EPOCH for a
>> > > > > transaction
>> > > > > > > timeout case, 2.5 client still has to treat as fatal as it
>> only
>> > > > > > recognizes
>> > > > > > > PRODUCER_FENCED. Returning UNKNOWN_PRODUCER_ID for 2.5 client
>> is a
>> > > > > > > substitute for TRANSACTION_TIMED_OUT error code, think of 9656
>> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-9656> as a
>> similar
>> > > > > > > motivation.
>> > > > > > >
>> > > > > > >
>> > > > > > > > > 2. The proposal says that newer clients talking to older
>> > > brokers
>> > > > > > should
>> > > > > > > > > treat PRODUCER_FENCED as non-fatal. Just to clarify, older
>> > > > brokers
>> > > > > > will
>> > > > > > > > > only return INVALID_PRODUCER_EPOCH because the
>> PRODUCER_FENCED
>> > > > > error
>> > > > > > > did
>> > > > > > > > > not exist. I think the logic should be something like
>> this.
>> > > > > > > > >
>> > > > > > > > > On the Transaction Coordinator:
>> > > > > > > > > - For old request versions, we continue returning
>> > > > > > > INVALID_PRODUCER_EPOCH.
>> > > > > > > > > Clients will translate this to PRODUCER_FENCED as they do
>> today
>> > > > and
>> > > > > > > treat
>> > > > > > > > > this as a fatal error.
>> > > > > > > > > - For new request versions, we return either
>> PRODUCER_FENCED or
>> > > > > > > > > TRANSACTION_TIMED_OUT depending on the case. Clients will
>> treat
>> > > > the
>> > > > > > > first
>> > > > > > > > > as fatal and will bump the epoch on the latter.
>> > > > > > > > >
>> > > > > > > > > On the partition leader:
>> > > > > > > > > - Nothing changes. Old clients will treat
>> > > INVALID_PRODUCER_EPOCH
>> > > > as
>> > > > > > > > fatal.
>> > > > > > > > > New clients will attempt to bump the epoch if the right
>> version
>> > > > of
>> > > > > > > > > InitProducerId is supported and fail otherwise.
>> > > > > > > > >
>> > > > > > > > > Does that seem right?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > As I stated in the previous answer, the txn coordinator could
>> also
>> > > > > choose
>> > > > > > > to return UNKNOWN_PRODUCER_ID for 2.5 clients or older
>> > > > > > > when there is a txn timeout. The rest looks good to me.
>> > > > > > >
>> > > > > > >
>> > > > > > > > > 3. Why do we need the bump to the Produce API? As far as
>> I can
>> > > > > tell,
>> > > > > > we
>> > > > > > > > are
>> > > > > > > > > not changing any errors that are used. The
>> > > INVALID_PRODUCER_EPOCH
>> > > > > > error
>> > > > > > > > is
>> > > > > > > > > already possible today and the two new errors cannot be
>> > > returned
>> > > > > from
>> > > > > > > > > Produce.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > Guozhang had some thoughts about it. I think one vague point
>> we
>> > > > haven't
>> > > > > > > clarified is what's the expected error code for server and
>> client.
>> > > > The
>> > > > > > > relation should look like:
>> > > > > > > old client INVALID_PRODUCER_EPOCH -> 47
>> (ProducerFencedException)
>> > > > > > > new client INVALID_PRODUCER_EPOCH -> 47
>> (InvalidEpochException),
>> > > > > > > PRODUCER_FENCED -> 91 (ProducerFencedException)
>> > > > > > > old broker INVALID_PRODUCER_EPOCH -> 47
>> (ProducerFencedException)
>> > > > > > > new broker INVALID_PRODUCER_EPOCH -> 47
>> (InvalidEpochException),
>> > > > > > > PRODUCER_FENCED -> 91 (ProducerFencedException)
>> > > > > > >
>> > > > > > > As we could see here, for a new partition leader, we shall
>> return
>> > > > code
>> > > > > 47
>> > > > > > > no matter what as stated in the KIP that PRODUCER_FENCED is
>> not a
>> > > > valid
>> > > > > > > case reported from partition leader. So new client should
>> always
>> > > > > trigger
>> > > > > > > the retry logic for partition leader error. I guess the
>> Produce API
>> > > > > bump
>> > > > > > is
>> > > > > > > not necessary then.
>> > > > > > >
>> > > > > > >
>> > > > > > > > > 4. The javadoc examples suggest the user should call
>> > > > > > > `initTransactions()`
>> > > > > > > > > in the case that a timeout is reached. I don't think
>> that's
>> > > > right.
>> > > > > > For
>> > > > > > > > > KIP-360, we bump the epoch internally. I think we'd want
>> to do
>> > > > the
>> > > > > > same
>> > > > > > > > > here so that we continue the current pattern where
>> > > > > > `initTransactions()`
>> > > > > > > > is
>> > > > > > > > > used just once.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > That's correct observation, will address it.
>> > > > > > >
>> > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > > Jason
>> > > > > > > > >
>> > > > > > > > > On Thu, Apr 16, 2020 at 2:24 PM Guozhang Wang <
>> > > > wangg...@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > +1 (binding), thanks!
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Apr 14, 2020 at 4:36 PM Boyang Chen <
>> > > > > > > > reluctanthero...@gmail.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hey all,
>> > > > > > > > > > >
>> > > > > > > > > > > I would like to start the vote for KIP-588:
>> > > > > > > > > > >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-588
>> > > > > > > > > > >
>> > > > > >
>> %3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
>> > > > > > > > > > >
>> > > > > > > > > > > Feel free to continue posting on discussion thread if
>> you
>> > > > have
>> > > > > > > > > > > any questions, thanks!
>> > > > > > > > > > >
>> > > > > > > > > > > Best,
>> > > > > > > > > > > Boyang
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > -- Guozhang
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > -- Guozhang
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > -- Guozhang
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > -- Guozhang
>> > > >
>> > >
>> >
>>
>

Re: [Vote] KIP-588: Allow producers to recover gracefully from transaction timeouts

Reply via email to