Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Jason Gustafson Tue, 27 Aug 2019 10:38:32 -0700

Hi Guozhang,

1. I think there are still some retriable errors that could affect the
transaction APIs. For example, COORDINATOR_LOAD_IN_PROGRESS.
2. Yes, this is right. The only fatal error is when the producer has been
fenced by another instance.


Thanks,
Jason

On Mon, Aug 26, 2019 at 6:05 PM Guozhang Wang <wangg...@gmail.com> wrote:

> Hi Jason,
>
> I've made another pass on the wiki page and it reads much better now. One
> more clarification about the "Simplified error handling" section:
>
> 1. There will be no "retriable error" from the broker side regarding any
> send requests and txn requests (to txn coordinators). All errors would
> cause the corresponding txn to eventually be aborted.
> 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the
> producer entering the ABORTABLE_ERROR state, but only the current txn to be
> aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to
> enter the FATAL_ERROR state, plus it would cause all future txns to be
> aborted.
>
> Is that right?
>
>
> Guozhang
>
>
> On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <matth...@confluent.io>
> wrote:
> >
> > Thanks Jason!
> >
> > LGTM.
> >
> > On 8/21/19 3:07 PM, Jason Gustafson wrote:
> > > Hi Matthias,
> > >
> > > Thanks, I appreciate the thorough review. I've revised the section to
> make
> > > the logic clearer. I think you have it right except for the 1). We only
> > > generate a new PID if the epoch cannot be incremented without overflow.
> > >
> > > -Jason
> > >
> > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <matth...@confluent.io
> >
> > > wrote:
> > >
> > >> Thanks for the KIP. I just have some clarification questions to make
> > >> sure I understand the proposal correctly:
> > >>
> > >> 1) "Safe Epoch Incrementing"
> > >>
> > >>> When the coordinator receives a new InitProducerId request, we will
> use
> > >> the following logic to update the epoch:
> > >>>
> > >>> 1. No epoch is provided: the current epoch will be bumped and the
> last
> > >> epoch will be set to -1.
> > >>> 2. Epoch and producerId are provided, and the provided producerId
> > >> matches the current producerId or the provided producerId matches the
> > >> previous producerId and the provided epoch is exhausted:
> > >>>       a. Provided epoch matches current epoch: the last epoch will be
> > >> set to the current epoch, and the current epoch will be bumped .
> > >>>       b. Provided epoch matches last epoch: the current epoch will be
> > >> returned
> > >>>       c. Else: return INVALID_PRODUCER_EPOCH
> > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
> > >>
> > >> Case (1) would be for a new producer. Hence, should we state that "no
> > >> PID" is provided (instead of "no epoch" is provided?). That might be
> > >> clearer and it implies that there is no epoch anyway.
> > >>
> > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> > >> error and tries to re-initialize itself.
> > >>
> > >> Case (2a) implies that the producer send its first request and is not
> > >> fenced. Case (2b) implies that the producer re-tries to re-initialize
> > >> itself, ie, it first request to re-initilize did not get a respond but
> > >> was processed by the transaction coordinator. Case (2c) implies that a
> > >> producer was fenced (similar case 3, even if I am not sure what case 3
> > >> actually would be?)
> > >>
> > >> Please let me know if my understanding is correct.
> > >>
> > >> What is still unclear to me is, why case (2 -- or is it only 2b?)
> > >> requires that the "provide epoch is exhausted"?
> > >>
> > >> For case 2b:
> > >>
> > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
> > >> PID/epoch pair. The TC processes the request and creates a new PID=101
> > >> with new epoch=0, however, the respond to the producer is lost. The TC
> > >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
> > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> > >> PID/epoch still matches the previous PID/epoch pair and hence the TC
> > >> know it's a retry?
> > >>
> > >> If this reasoning is correct, should the logic be as follows:
> > >>
> > >> 1. No PID is provided: create a new PID with epoch=0 and set the last
> > >> epoch to -1.
> > >> 2. Epoch and producerId are provided
> > >>    a) the provided producerId/epoch matches the current
> producerId/epoch:
> > >>       i) if the epoch is not exhausted, bump the epoch
> > >>       ii) if the epoch is exhausted, create a new PID with epoch=0
> > >>    b) the provided producerId/epoch matches the previous
> > >> producerId/epoch: respond with current PID/epoch
> > >>    c) Otherwise, return INVALID_PRODUCER_EPOCH
> > >>
> > >>
> > >>
> > >> -Matthias
> > >>
> > >>
> > >>
> > >>
> > >> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> > >>> Hi Everyone,
> > >>>
> > >>> Sorry for the long delay on this KIP. I have updated it to include
> the
> > >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there
> are
> > >> no
> > >>> further comments, I will plan to start a vote early next week.
> > >>>
> > >>> Thanks!
> > >>> Jason
> > >>>
> > >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <
> adam.bellem...@gmail.com
> > >>>
> > >>> wrote:
> > >>>
> > >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> > >>>>
> > >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> > >> adam.bellem...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi John
> > >>>>>
> > >>>>> What is the status of this KIP?
> > >>>>>
> > >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error
> on
> > >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that a
> > >> proper
> > >>>>> fix is needed.
> > >>>>>
> > >>>>> Adam
> > >>>>>
> > >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wangg...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>>> Thanks Jason. The proposed solution sounds good to me.
> > >>>>>>
> > >>>>>>
> > >>>>>> Guozhang
> > >>>>>>
> > >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <
> ja...@confluent.io>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hey Guozhang,
> > >>>>>>>
> > >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING
> error
> > >>>>>>> occurs following expiration of the producerId. It's possible that
> > >>>>>> another
> > >>>>>>> producerId has been installed in its place following expiration
> (if
> > >>>>>> another
> > >>>>>>> producer instance has become active), or the mapping is empty. We
> can
> > >>>>>>> safely retry the InitProducerId with the logic in this KIP in
> order
> > >> to
> > >>>>>>> detect which case it is. So I'd suggest something like this:
> > >>>>>>>
> > >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can
> send
> > >>>>>>> InitProducerId using the current producerId and epoch.
> > >>>>>>> 2. If no mapping exists, the coordinator can generate a new
> > >> producerId
> > >>>>>> and
> > >>>>>>> return it. If a transaction is in progress on the client, it will
> > >> have
> > >>>>>> to
> > >>>>>>> be aborted, but the producer can continue afterwards.
> > >>>>>>> 3. Otherwise if a different producerId has been assigned, then we
> can
> > >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling,
> we
> > >> can
> > >>>>>>> probably raise this as ProducerFencedException since that is
> > >>>> effectively
> > >>>>>>> what has happened. Ideally this is the only fatal case that users
> > >> have
> > >>>>>> to
> > >>>>>>> handle.
> > >>>>>>>
> > >>>>>>> I'll give it a little more thought and update the KIP.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Jason
> > >>>>>>>
> > >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wangg...@gmail.com
> >
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> You're right about the dangling txn since it will actually block
> > >>>>>>>> read-committed consumers from proceeding at all. I'd agree that
> > >>>> since
> > >>>>>>> this
> > >>>>>>>> is a very rare case, we can consider fixing it not via
> broker-side
> > >>>>>> logic
> > >>>>>>>> but via tooling in a future work.
> > >>>>>>>>
> > >>>>>>>> I've also discovered some related error handling logic inside
> > >>>> producer
> > >>>>>>> that
> > >>>>>>>> may be addressed together with this KIP (since it is mostly for
> > >>>>>> internal
> > >>>>>>>> implementations the wiki itself does not need to be modified):
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
>
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> > >>>>>>>>
> > >>>>>>>> Guozhang
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <
> ja...@confluent.io
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hey Guozhang,
> > >>>>>>>>>
> > >>>>>>>>> To clarify, the broker does not actually use the ApiVersion API
> > >>>> for
> > >>>>>>>>> inter-broker communications. The use of an API and its
> > >>>> corresponding
> > >>>>>>>>> version is controlled by `inter.broker.protocol.version`.
> > >>>>>>>>>
> > >>>>>>>>> Nevertheless, it sounds like we're on the same page about
> removing
> > >>>>>>>>> DescribeTransactionState. The impact of a dangling transaction
> is
> > >>>> a
> > >>>>>>>> little
> > >>>>>>>>> worse than what you describe though. Consumers with the
> > >>>>>> read_committed
> > >>>>>>>>> isolation level will be stuck. Still, I think we agree that
> this
> > >>>>>> case
> > >>>>>>>>> should be rare and we can reconsider for future work. Rather
> than
> > >>>>>>>>> preventing dangling transactions, perhaps we should consider
> > >>>> options
> > >>>>>>>> which
> > >>>>>>>>> allows us to detect them and recover. Anyway, this needs more
> > >>>>>> thought.
> > >>>>>>> I
> > >>>>>>>>> will update the KIP.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Jason
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <
> wangg...@gmail.com
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> 0. My original question is about the implementation details
> > >>>>>>> primarily,
> > >>>>>>>>>> since current the handling logic of the APIVersionResponse is
> > >>>>>> simply
> > >>>>>>>> "use
> > >>>>>>>>>> the highest supported version of the corresponding request",
> but
> > >>>>>> if
> > >>>>>>> the
> > >>>>>>>>>> returned response from APIVersionRequest says "I don't even
> know
> > >>>>>>> about
> > >>>>>>>>> the
> > >>>>>>>>>> DescribeTransactionStateRequest at all", then we need
> additional
> > >>>>>>> logic
> > >>>>>>>>> for
> > >>>>>>>>>> the falling back logic. Currently this logic is embedded in
> > >>>>>>>> NetworkClient
> > >>>>>>>>>> which is shared by all clients, so I'd like to avoid making
> this
> > >>>>>>> logic
> > >>>>>>>>> more
> > >>>>>>>>>> complicated.
> > >>>>>>>>>>
> > >>>>>>>>>> As for the general issue that a broker does not recognize a
> > >>>>>> producer
> > >>>>>>>> with
> > >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
> > >>>>>> wiki,
> > >>>>>>>> this
> > >>>>>>>>>> is only a concern for transactional producer since for
> > >>>> idempotent
> > >>>>>>>>> producer
> > >>>>>>>>>> it can just bump the epoch and go. For transactional producer,
> > >>>>>> even
> > >>>>>>> if
> > >>>>>>>>> the
> > >>>>>>>>>> producer request from a fenced producer gets accepted, its
> > >>>>>>> transaction
> > >>>>>>>>> will
> > >>>>>>>>>> never be committed and hence messages not exposed to
> > >>>>>> read-committed
> > >>>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
> > >>>>>>>> consumers
> > >>>>>>>>>> will still read those messages, 2) unnecessary storage for
> those
> > >>>>>>> fenced
> > >>>>>>>>>> produce messages, but in practice should not accumulate to a
> > >>>> large
> > >>>>>>>> amount
> > >>>>>>>>>> since producer should soon try to commit and be told it is
> > >>>> fenced
> > >>>>>> and
> > >>>>>>>>> then
> > >>>>>>>>>> stop, 3) there will be no markers for those transactional
> > >>>> messages
> > >>>>>>>> ever.
> > >>>>>>>>>> Looking at the list and thinking about the likelihood it may
> > >>>>>> happen
> > >>>>>>>>>> assuming we retain the producer up to transactional.id.timeout
> > >>>>>>> (default
> > >>>>>>>>> is
> > >>>>>>>>>> 7 days), I feel comfortable leaving it as is.
> > >>>>>>>>>>
> > >>>>>>>>>> Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> > >>>>>> ja...@confluent.io>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hey Guozhang,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the comments. Responses below:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 0. The new API is used between brokers, so we govern its
> usage
> > >>>>>>> using
> > >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> > >>>>>>> upgraded,
> > >>>>>>>>> we
> > >>>>>>>>>>> will just fallback to the old logic, which is to accept the
> > >>>>>> write.
> > >>>>>>>> This
> > >>>>>>>>>> is
> > >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> > >>>> Does
> > >>>>>>> that
> > >>>>>>>>>> seem
> > >>>>>>>>>>> reasonable?
> > >>>>>>>>>>>
> > >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
> > >>>>>> over, I
> > >>>>>>>> am
> > >>>>>>>>>>> doubting how crucial this API is. It is attempting to protect
> > >>>> a
> > >>>>>>> write
> > >>>>>>>>>> from
> > >>>>>>>>>>> a zombie which has just reset its sequence number after that
> > >>>>>>> producer
> > >>>>>>>>> had
> > >>>>>>>>>>> had its state cleaned up. However, one of the other
> > >>>>>> improvements in
> > >>>>>>>>> this
> > >>>>>>>>>>> KIP is to maintain producer state beyond its retention in the
> > >>>>>> log.
> > >>>>>>> I
> > >>>>>>>>>> think
> > >>>>>>>>>>> that makes this case sufficiently unlikely that we can leave
> > >>>> it
> > >>>>>> for
> > >>>>>>>>>> future
> > >>>>>>>>>>> work. I am not 100% sure this is the only scenario where
> > >>>>>>> transaction
> > >>>>>>>>>> state
> > >>>>>>>>>>> and log state can diverge anyway, so it would be better to
> > >>>>>> consider
> > >>>>>>>>> this
> > >>>>>>>>>>> problem more generally. What do you think?
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
> > >>>>>> iteration.
> > >>>>>>>>> I'll
> > >>>>>>>>>>> make a pass and try to clarify usage.
> > >>>>>>>>>>> 2. I was attempting to handle some edge cases since this
> check
> > >>>>>>> would
> > >>>>>>>> be
> > >>>>>>>>>>> asynchronous. In any case, if we drop this validation as
> > >>>>>> suggested
> > >>>>>>>>> above,
> > >>>>>>>>>>> then we can ignore this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Jason
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> > >>>>>> wangg...@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hello Jason, thanks for the great write-up.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 0. One question about the migration plan: "The new
> > >>>>>>>>> GetTransactionState
> > >>>>>>>>>>> API
> > >>>>>>>>>>>> and the new version of the transaction state message will
> > >>>> not
> > >>>>>> be
> > >>>>>>>> used
> > >>>>>>>>>>> until
> > >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
> > >>>> about
> > >>>>>> the
> > >>>>>>>>>>>> implementation details here: say a broker is on the newer
> > >>>>>> version
> > >>>>>>>> and
> > >>>>>>>>>> the
> > >>>>>>>>>>>> txn-coordinator is still on older version. Today the
> > >>>>>>>>> APIVersionsRequest
> > >>>>>>>>>>> can
> > >>>>>>>>>>>> only help upgrade / downgrade the request version, but not
> > >>>>>>>> forbidding
> > >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
> > >>>> the
> > >>>>>>>> broker
> > >>>>>>>>>>> side
> > >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
> > >>>>>>> concern
> > >>>>>>>> is
> > >>>>>>>>>>> that
> > >>>>>>>>>>>> this will be some tech-debt code that will live long before
> > >>>>>> being
> > >>>>>>>>>>> removed.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Some additional minor comments:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
> > >>>>>> the
> > >>>>>>>> same
> > >>>>>>>>>>> thing
> > >>>>>>>>>>>> in your wiki.
> > >>>>>>>>>>>> 2. "The broker must verify after receiving the response that
> > >>>>>> the
> > >>>>>>>>>> producer
> > >>>>>>>>>>>> state is still unknown.": not sure why we have to validate?
> > >>>> If
> > >>>>>>> the
> > >>>>>>>>>>> metadata
> > >>>>>>>>>>>> returned from the txn-coordinator can always be considered
> > >>>> the
> > >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> > >>>>>> cache?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> > >>>>>>>>> matth...@confluent.io>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> I am +1 on this :)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> > >>>> link
> > >>>>>> to
> > >>>>>>>> the
> > >>>>>>>>>> KIP:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > >>>>>>>>>>>>>> .
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Jason
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > >>>>>>>>>> ja...@confluent.io
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi All,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I have a proposal to improve the
> > >>>> transactional/idempotent
> > >>>>>>>>>> producer's
> > >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> > >>>>>> result
> > >>>>>>> of
> > >>>>>>>>>>> losing
> > >>>>>>>>>>>>>>> producer state following segment removal. The current
> > >>>>>>> behavior
> > >>>>>>>>> is
> > >>>>>>>>>>> both
> > >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
> > >>>>>> what
> > >>>>>>>> you
> > >>>>>>>>>>>> think.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> > >>>>>>> initial
> > >>>>>>>>>> draft.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> -Jason
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> -- Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> -- Guozhang
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Reply via email to