That's a good suggestion Jason. Adding a dedicated PRODUCER_FENCED error should help distinguish exceptions and could safely mark INVALID_PRODUCER_EPOCH exception as non-fatal in the new code. Updated the KIP.
Boyang On Wed, Apr 8, 2020 at 12:18 PM Jason Gustafson <ja...@confluent.io> wrote: > Hey Boyang, > > Thanks for the KIP. I think the main problem we've identified here is that > the current errors conflate transaction timeouts with producer fencing. The > first of these ought to be recoverable, but we cannot distinguish it. The > suggestion to add a new error code makes sense to me, but it leaves this > bit of awkwardness: > > > One extra issue that needs to be addressed is how to handle > `ProducerFenced` from Produce requests. > > In fact, the underlying error code here is INVALID_PRODUCER_EPOCH. It's > just that the code treats this as equivalent to `ProducerFenced`. One > thought I had is maybe PRODUCER_FENCED needs to be a separate error code as > well. After all, only the transaction coordinator knows whether a producer > has been fenced or not. So maybe the handling could be something like the > following: > > 1. Produce requests may return INVALID_PRODUCER_EPOCH. The producer > recovers by following KIP-360 logic to see whether the epoch can be bumped. > If it cannot because the broker version is too old, we fail. > 2. Transactional APIs may return either TRANSACTION_TIMEOUT or > PRODUCER_FENCED. In the first case, we do the same as above. We try to > recover by bumping the epoch. If the error is PRODUCER_FENCED, it is fatal. > 3. Older brokers may return INVALID_PRODUCER_EPOCH as well from > transactional APIs. We treat this the same as 1. > > What do you think? > > Thanks, > Jason > > > > > > > > > > > On Mon, Apr 6, 2020 at 3:41 PM Boyang Chen <reluctanthero...@gmail.com> > wrote: > > > Yep, updated the KIP, thanks! > > > > On Mon, Apr 6, 2020 at 3:11 PM Guozhang Wang <wangg...@gmail.com> wrote: > > > > > Regarding 2), sounds good, I saw UNKNOWN_PRODUCER_ID is properly > handled > > > today in produce / add-partitions-to-txn / add-offsets-to-txn / end-txn > > > responses, so that should be well covered. > > > > > > Could you reflect this in the wiki page that the broker should be > > > responsible for using different error codes given client request > versions > > > as well? > > > > > > > > > > > > Guozhang > > > > > > On Mon, Apr 6, 2020 at 9:20 AM Boyang Chen <reluctanthero...@gmail.com > > > > > wrote: > > > > > > > Thanks Guozhang for the review! > > > > > > > > On Sun, Apr 5, 2020 at 5:47 PM Guozhang Wang <wangg...@gmail.com> > > wrote: > > > > > > > > > Hello Boyang, > > > > > > > > > > Thank you for the proposed KIP. Just some minor comments below: > > > > > > > > > > 1. Could you also describe which producer APIs could potentially > > throw > > > > the > > > > > new TransactionTimedOutException, and also how should callers > handle > > > them > > > > > differently (i.e. just to make your description more concrete as > > > > javadocs). > > > > > > > > > > Good point, I will add example java doc changes. > > > > > > > > > > > > > 2. It's straight-forward if client is on newer version while > broker's > > > on > > > > > older version; however If the client is on older version while > > broker's > > > > on > > > > > newer version, today would the internal module of producers treat > it > > > as a > > > > > general fatal error or not? If not, should the broker set a > different > > > > error > > > > > code upon detecting older request versions? > > > > > > > > > > That's a good suggestion, my understanding is that the prerequisite > > of > > > > this change is the new KIP-360 API which is going out with 2.5, > > > > so we could just return UNKNOWN_PRODUCER_ID instead of > PRODUCER_FENCED > > as > > > > it could be interpreted as abortable error > > > > in 2.5 producer and retry. Producers older than 2.5 will not be > > covered. > > > > WDYT? > > > > > > > > > > > > > > Guozhang > > > > > > > > > > On Thu, Apr 2, 2020 at 1:40 PM Boyang Chen < > > reluctanthero...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > Hey there, > > > > > > > > > > > > I would like to start discussion for KIP-588: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts > > > > > > > > > > > > which aims to improve Producer resilience to transaction timeout > > due > > > to > > > > > > transient system gaps. > > > > > > > > > > > > Best, > > > > > > Boyang > > > > > > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > >