Makes sense, thanks! On Tue, Aug 27, 2019 at 10:38 AM Jason Gustafson <ja...@confluent.io> wrote:
> Hi Guozhang, > > 1. I think there are still some retriable errors that could affect the > transaction APIs. For example, COORDINATOR_LOAD_IN_PROGRESS. > 2. Yes, this is right. The only fatal error is when the producer has been > fenced by another instance. > > Thanks, > Jason > > On Mon, Aug 26, 2019 at 6:05 PM Guozhang Wang <wangg...@gmail.com> wrote: > > > Hi Jason, > > > > I've made another pass on the wiki page and it reads much better now. One > > more clarification about the "Simplified error handling" section: > > > > 1. There will be no "retriable error" from the broker side regarding any > > send requests and txn requests (to txn coordinators). All errors would > > cause the corresponding txn to eventually be aborted. > > 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the > > producer entering the ABORTABLE_ERROR state, but only the current txn to > be > > aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to > > enter the FATAL_ERROR state, plus it would cause all future txns to be > > aborted. > > > > Is that right? > > > > > > Guozhang > > > > > > On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <matth...@confluent.io> > > wrote: > > > > > > Thanks Jason! > > > > > > LGTM. > > > > > > On 8/21/19 3:07 PM, Jason Gustafson wrote: > > > > Hi Matthias, > > > > > > > > Thanks, I appreciate the thorough review. I've revised the section to > > make > > > > the logic clearer. I think you have it right except for the 1). We > only > > > > generate a new PID if the epoch cannot be incremented without > overflow. > > > > > > > > -Jason > > > > > > > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax < > matth...@confluent.io > > > > > > > wrote: > > > > > > > >> Thanks for the KIP. I just have some clarification questions to make > > > >> sure I understand the proposal correctly: > > > >> > > > >> 1) "Safe Epoch Incrementing" > > > >> > > > >>> When the coordinator receives a new InitProducerId request, we will > > use > > > >> the following logic to update the epoch: > > > >>> > > > >>> 1. No epoch is provided: the current epoch will be bumped and the > > last > > > >> epoch will be set to -1. > > > >>> 2. Epoch and producerId are provided, and the provided producerId > > > >> matches the current producerId or the provided producerId matches > the > > > >> previous producerId and the provided epoch is exhausted: > > > >>> a. Provided epoch matches current epoch: the last epoch will > be > > > >> set to the current epoch, and the current epoch will be bumped . > > > >>> b. Provided epoch matches last epoch: the current epoch will > be > > > >> returned > > > >>> c. Else: return INVALID_PRODUCER_EPOCH > > > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH > > > >> > > > >> Case (1) would be for a new producer. Hence, should we state that > "no > > > >> PID" is provided (instead of "no epoch" is provided?). That might be > > > >> clearer and it implies that there is no epoch anyway. > > > >> > > > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID` > > > >> error and tries to re-initialize itself. > > > >> > > > >> Case (2a) implies that the producer send its first request and is > not > > > >> fenced. Case (2b) implies that the producer re-tries to > re-initialize > > > >> itself, ie, it first request to re-initilize did not get a respond > but > > > >> was processed by the transaction coordinator. Case (2c) implies > that a > > > >> producer was fenced (similar case 3, even if I am not sure what > case 3 > > > >> actually would be?) > > > >> > > > >> Please let me know if my understanding is correct. > > > >> > > > >> What is still unclear to me is, why case (2 -- or is it only 2b?) > > > >> requires that the "provide epoch is exhausted"? > > > >> > > > >> For case 2b: > > > >> > > > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an > > > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the > corresponding > > > >> PID/epoch pair. The TC processes the request and creates a new > PID=101 > > > >> with new epoch=0, however, the respond to the producer is lost. The > TC > > > >> still stores `currentPid=101`, `currentEpoch=0` and > `previousPid=100`, > > > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent > > > >> PID/epoch still matches the previous PID/epoch pair and hence the TC > > > >> know it's a retry? > > > >> > > > >> If this reasoning is correct, should the logic be as follows: > > > >> > > > >> 1. No PID is provided: create a new PID with epoch=0 and set the > last > > > >> epoch to -1. > > > >> 2. Epoch and producerId are provided > > > >> a) the provided producerId/epoch matches the current > > producerId/epoch: > > > >> i) if the epoch is not exhausted, bump the epoch > > > >> ii) if the epoch is exhausted, create a new PID with epoch=0 > > > >> b) the provided producerId/epoch matches the previous > > > >> producerId/epoch: respond with current PID/epoch > > > >> c) Otherwise, return INVALID_PRODUCER_EPOCH > > > >> > > > >> > > > >> > > > >> -Matthias > > > >> > > > >> > > > >> > > > >> > > > >> On 4/4/19 3:47 PM, Jason Gustafson wrote: > > > >>> Hi Everyone, > > > >>> > > > >>> Sorry for the long delay on this KIP. I have updated it to include > > the > > > >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If > there > > are > > > >> no > > > >>> further comments, I will plan to start a vote early next week. > > > >>> > > > >>> Thanks! > > > >>> Jason > > > >>> > > > >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare < > > adam.bellem...@gmail.com > > > >>> > > > >>> wrote: > > > >>> > > > >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email. > > > >>>> > > > >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare < > > > >> adam.bellem...@gmail.com> > > > >>>> wrote: > > > >>>> > > > >>>>> Hi John > > > >>>>> > > > >>>>> What is the status of this KIP? > > > >>>>> > > > >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" > error > > on > > > >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that > a > > > >> proper > > > >>>>> fix is needed. > > > >>>>> > > > >>>>> Adam > > > >>>>> > > > >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wangg...@gmail.com > > > > > >> wrote: > > > >>>>> > > > >>>>>> Thanks Jason. The proposed solution sounds good to me. > > > >>>>>> > > > >>>>>> > > > >>>>>> Guozhang > > > >>>>>> > > > >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson < > > ja...@confluent.io> > > > >>>>>> wrote: > > > >>>>>> > > > >>>>>>> Hey Guozhang, > > > >>>>>>> > > > >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING > > error > > > >>>>>>> occurs following expiration of the producerId. It's possible > that > > > >>>>>> another > > > >>>>>>> producerId has been installed in its place following expiration > > (if > > > >>>>>> another > > > >>>>>>> producer instance has become active), or the mapping is empty. > We > > can > > > >>>>>>> safely retry the InitProducerId with the logic in this KIP in > > order > > > >> to > > > >>>>>>> detect which case it is. So I'd suggest something like this: > > > >>>>>>> > > > >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer > can > > send > > > >>>>>>> InitProducerId using the current producerId and epoch. > > > >>>>>>> 2. If no mapping exists, the coordinator can generate a new > > > >> producerId > > > >>>>>> and > > > >>>>>>> return it. If a transaction is in progress on the client, it > will > > > >> have > > > >>>>>> to > > > >>>>>>> be aborted, but the producer can continue afterwards. > > > >>>>>>> 3. Otherwise if a different producerId has been assigned, then > we > > can > > > >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, > > we > > > >> can > > > >>>>>>> probably raise this as ProducerFencedException since that is > > > >>>> effectively > > > >>>>>>> what has happened. Ideally this is the only fatal case that > users > > > >> have > > > >>>>>> to > > > >>>>>>> handle. > > > >>>>>>> > > > >>>>>>> I'll give it a little more thought and update the KIP. > > > >>>>>>> > > > >>>>>>> Thanks, > > > >>>>>>> Jason > > > >>>>>>> > > > >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang < > wangg...@gmail.com > > > > > > >>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> You're right about the dangling txn since it will actually > block > > > >>>>>>>> read-committed consumers from proceeding at all. I'd agree > that > > > >>>> since > > > >>>>>>> this > > > >>>>>>>> is a very rare case, we can consider fixing it not via > > broker-side > > > >>>>>> logic > > > >>>>>>>> but via tooling in a future work. > > > >>>>>>>> > > > >>>>>>>> I've also discovered some related error handling logic inside > > > >>>> producer > > > >>>>>>> that > > > >>>>>>>> may be addressed together with this KIP (since it is mostly > for > > > >>>>>> internal > > > >>>>>>>> implementations the wiki itself does not need to be modified): > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>> > > > >> > > > > > https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181 > > > >>>>>>>> > > > >>>>>>>> Guozhang > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson < > > ja...@confluent.io > > > >>>>> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Hey Guozhang, > > > >>>>>>>>> > > > >>>>>>>>> To clarify, the broker does not actually use the ApiVersion > API > > > >>>> for > > > >>>>>>>>> inter-broker communications. The use of an API and its > > > >>>> corresponding > > > >>>>>>>>> version is controlled by `inter.broker.protocol.version`. > > > >>>>>>>>> > > > >>>>>>>>> Nevertheless, it sounds like we're on the same page about > > removing > > > >>>>>>>>> DescribeTransactionState. The impact of a dangling > transaction > > is > > > >>>> a > > > >>>>>>>> little > > > >>>>>>>>> worse than what you describe though. Consumers with the > > > >>>>>> read_committed > > > >>>>>>>>> isolation level will be stuck. Still, I think we agree that > > this > > > >>>>>> case > > > >>>>>>>>> should be rare and we can reconsider for future work. Rather > > than > > > >>>>>>>>> preventing dangling transactions, perhaps we should consider > > > >>>> options > > > >>>>>>>> which > > > >>>>>>>>> allows us to detect them and recover. Anyway, this needs more > > > >>>>>> thought. > > > >>>>>>> I > > > >>>>>>>>> will update the KIP. > > > >>>>>>>>> > > > >>>>>>>>> Best, > > > >>>>>>>>> Jason > > > >>>>>>>>> > > > >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang < > > wangg...@gmail.com > > > >>>>> > > > >>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> 0. My original question is about the implementation details > > > >>>>>>> primarily, > > > >>>>>>>>>> since current the handling logic of the APIVersionResponse > is > > > >>>>>> simply > > > >>>>>>>> "use > > > >>>>>>>>>> the highest supported version of the corresponding request", > > but > > > >>>>>> if > > > >>>>>>> the > > > >>>>>>>>>> returned response from APIVersionRequest says "I don't even > > know > > > >>>>>>> about > > > >>>>>>>>> the > > > >>>>>>>>>> DescribeTransactionStateRequest at all", then we need > > additional > > > >>>>>>> logic > > > >>>>>>>>> for > > > >>>>>>>>>> the falling back logic. Currently this logic is embedded in > > > >>>>>>>> NetworkClient > > > >>>>>>>>>> which is shared by all clients, so I'd like to avoid making > > this > > > >>>>>>> logic > > > >>>>>>>>> more > > > >>>>>>>>>> complicated. > > > >>>>>>>>>> > > > >>>>>>>>>> As for the general issue that a broker does not recognize a > > > >>>>>> producer > > > >>>>>>>> with > > > >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in > the > > > >>>>>> wiki, > > > >>>>>>>> this > > > >>>>>>>>>> is only a concern for transactional producer since for > > > >>>> idempotent > > > >>>>>>>>> producer > > > >>>>>>>>>> it can just bump the epoch and go. For transactional > producer, > > > >>>>>> even > > > >>>>>>> if > > > >>>>>>>>> the > > > >>>>>>>>>> producer request from a fenced producer gets accepted, its > > > >>>>>>> transaction > > > >>>>>>>>> will > > > >>>>>>>>>> never be committed and hence messages not exposed to > > > >>>>>> read-committed > > > >>>>>>>>>> consumers as well. The drawback is though, 1) > read-uncommitted > > > >>>>>>>> consumers > > > >>>>>>>>>> will still read those messages, 2) unnecessary storage for > > those > > > >>>>>>> fenced > > > >>>>>>>>>> produce messages, but in practice should not accumulate to a > > > >>>> large > > > >>>>>>>> amount > > > >>>>>>>>>> since producer should soon try to commit and be told it is > > > >>>> fenced > > > >>>>>> and > > > >>>>>>>>> then > > > >>>>>>>>>> stop, 3) there will be no markers for those transactional > > > >>>> messages > > > >>>>>>>> ever. > > > >>>>>>>>>> Looking at the list and thinking about the likelihood it may > > > >>>>>> happen > > > >>>>>>>>>> assuming we retain the producer up to > transactional.id.timeout > > > >>>>>>> (default > > > >>>>>>>>> is > > > >>>>>>>>>> 7 days), I feel comfortable leaving it as is. > > > >>>>>>>>>> > > > >>>>>>>>>> Guozhang > > > >>>>>>>>>> > > > >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson < > > > >>>>>> ja...@confluent.io> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Hey Guozhang, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks for the comments. Responses below: > > > >>>>>>>>>>> > > > >>>>>>>>>>> 0. The new API is used between brokers, so we govern its > > usage > > > >>>>>>> using > > > >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't > > > >>>>>>> upgraded, > > > >>>>>>>>> we > > > >>>>>>>>>>> will just fallback to the old logic, which is to accept the > > > >>>>>> write. > > > >>>>>>>> This > > > >>>>>>>>>> is > > > >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API. > > > >>>> Does > > > >>>>>>> that > > > >>>>>>>>>> seem > > > >>>>>>>>>>> reasonable? > > > >>>>>>>>>>> > > > >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it > > > >>>>>> over, I > > > >>>>>>>> am > > > >>>>>>>>>>> doubting how crucial this API is. It is attempting to > protect > > > >>>> a > > > >>>>>>> write > > > >>>>>>>>>> from > > > >>>>>>>>>>> a zombie which has just reset its sequence number after > that > > > >>>>>>> producer > > > >>>>>>>>> had > > > >>>>>>>>>>> had its state cleaned up. However, one of the other > > > >>>>>> improvements in > > > >>>>>>>>> this > > > >>>>>>>>>>> KIP is to maintain producer state beyond its retention in > the > > > >>>>>> log. > > > >>>>>>> I > > > >>>>>>>>>> think > > > >>>>>>>>>>> that makes this case sufficiently unlikely that we can > leave > > > >>>> it > > > >>>>>> for > > > >>>>>>>>>> future > > > >>>>>>>>>>> work. I am not 100% sure this is the only scenario where > > > >>>>>>> transaction > > > >>>>>>>>>> state > > > >>>>>>>>>>> and log state can diverge anyway, so it would be better to > > > >>>>>> consider > > > >>>>>>>>> this > > > >>>>>>>>>>> problem more generally. What do you think? > > > >>>>>>>>>>> > > > >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first > > > >>>>>> iteration. > > > >>>>>>>>> I'll > > > >>>>>>>>>>> make a pass and try to clarify usage. > > > >>>>>>>>>>> 2. I was attempting to handle some edge cases since this > > check > > > >>>>>>> would > > > >>>>>>>> be > > > >>>>>>>>>>> asynchronous. In any case, if we drop this validation as > > > >>>>>> suggested > > > >>>>>>>>> above, > > > >>>>>>>>>>> then we can ignore this. > > > >>>>>>>>>>> > > > >>>>>>>>>>> -Jason > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang < > > > >>>>>> wangg...@gmail.com> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> Hello Jason, thanks for the great write-up. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 0. One question about the migration plan: "The new > > > >>>>>>>>> GetTransactionState > > > >>>>>>>>>>> API > > > >>>>>>>>>>>> and the new version of the transaction state message will > > > >>>> not > > > >>>>>> be > > > >>>>>>>> used > > > >>>>>>>>>>> until > > > >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear > > > >>>> about > > > >>>>>> the > > > >>>>>>>>>>>> implementation details here: say a broker is on the newer > > > >>>>>> version > > > >>>>>>>> and > > > >>>>>>>>>> the > > > >>>>>>>>>>>> txn-coordinator is still on older version. Today the > > > >>>>>>>>> APIVersionsRequest > > > >>>>>>>>>>> can > > > >>>>>>>>>>>> only help upgrade / downgrade the request version, but not > > > >>>>>>>> forbidding > > > >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on > > > >>>> the > > > >>>>>>>> broker > > > >>>>>>>>>>> side > > > >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my > > > >>>>>>> concern > > > >>>>>>>> is > > > >>>>>>>>>>> that > > > >>>>>>>>>>>> this will be some tech-debt code that will live long > before > > > >>>>>> being > > > >>>>>>>>>>> removed. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Some additional minor comments: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring > to > > > >>>>>> the > > > >>>>>>>> same > > > >>>>>>>>>>> thing > > > >>>>>>>>>>>> in your wiki. > > > >>>>>>>>>>>> 2. "The broker must verify after receiving the response > that > > > >>>>>> the > > > >>>>>>>>>> producer > > > >>>>>>>>>>>> state is still unknown.": not sure why we have to > validate? > > > >>>> If > > > >>>>>>> the > > > >>>>>>>>>>> metadata > > > >>>>>>>>>>>> returned from the txn-coordinator can always be considered > > > >>>> the > > > >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the > > > >>>>>> cache? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Guozhang > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax < > > > >>>>>>>>> matth...@confluent.io> > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> I am +1 on this :) > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> -Matthias > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote: > > > >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to > > > >>>> link > > > >>>>>> to > > > >>>>>>>> the > > > >>>>>>>>>> KIP: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>> > > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer > > > >>>>>>>>>>>>>> . > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> -Jason > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson < > > > >>>>>>>>>> ja...@confluent.io > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Hi All, > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> I have a proposal to improve the > > > >>>> transactional/idempotent > > > >>>>>>>>>> producer's > > > >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the > > > >>>>>> result > > > >>>>>>> of > > > >>>>>>>>>>> losing > > > >>>>>>>>>>>>>>> producer state following segment removal. The current > > > >>>>>>> behavior > > > >>>>>>>>> is > > > >>>>>>>>>>> both > > > >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know > > > >>>>>> what > > > >>>>>>>> you > > > >>>>>>>>>>>> think. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the > > > >>>>>>> initial > > > >>>>>>>>>> draft. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> -Jason > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> -- > > > >>>>>>>>>>>> -- Guozhang > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> -- > > > >>>>>>>>>> -- Guozhang > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> -- > > > >>>>>>>> -- Guozhang > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> -- > > > >>>>>> -- Guozhang > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > >> > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang