Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Matthias J. Sax Tue, 09 Apr 2019 08:57:12 -0700

@Adam:

As workaround, you can increase the repartition topic config
`segment.bytes` and set a larger segment size. This should mitigate the
issue.



-Matthias



On 4/4/19 3:47 PM, Jason Gustafson wrote:
> Hi Everyone,
> 
> Sorry for the long delay on this KIP. I have updated it to include the
> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no
> further comments, I will plan to start a vote early next week.
> 
> Thanks!
> Jason
> 
> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <adam.bellem...@gmail.com>
> wrote:
> 
>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>>
>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <adam.bellem...@gmail.com>
>> wrote:
>>
>>> Hi John
>>>
>>> What is the status of this KIP?
>>>
>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
>>> 2.1.1 for a multitude of our internal topics, and I suspect that a proper
>>> fix is needed.
>>>
>>> Adam
>>>
>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wangg...@gmail.com> wrote:
>>>
>>>> Thanks Jason. The proposed solution sounds good to me.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>
>>>>> Hey Guozhang,
>>>>>
>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
>>>>> occurs following expiration of the producerId. It's possible that
>>>> another
>>>>> producerId has been installed in its place following expiration (if
>>>> another
>>>>> producer instance has become active), or the mapping is empty. We can
>>>>> safely retry the InitProducerId with the logic in this KIP in order to
>>>>> detect which case it is. So I'd suggest something like this:
>>>>>
>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
>>>>> InitProducerId using the current producerId and epoch.
>>>>> 2. If no mapping exists, the coordinator can generate a new producerId
>>>> and
>>>>> return it. If a transaction is in progress on the client, it will have
>>>> to
>>>>> be aborted, but the producer can continue afterwards.
>>>>> 3. Otherwise if a different producerId has been assigned, then we can
>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
>>>>> probably raise this as ProducerFencedException since that is
>> effectively
>>>>> what has happened. Ideally this is the only fatal case that users have
>>>> to
>>>>> handle.
>>>>>
>>>>> I'll give it a little more thought and update the KIP.
>>>>>
>>>>> Thanks,
>>>>> Jason
>>>>>
>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wangg...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> You're right about the dangling txn since it will actually block
>>>>>> read-committed consumers from proceeding at all. I'd agree that
>> since
>>>>> this
>>>>>> is a very rare case, we can consider fixing it not via broker-side
>>>> logic
>>>>>> but via tooling in a future work.
>>>>>>
>>>>>> I've also discovered some related error handling logic inside
>> producer
>>>>> that
>>>>>> may be addressed together with this KIP (since it is mostly for
>>>> internal
>>>>>> implementations the wiki itself does not need to be modified):
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <ja...@confluent.io
>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> To clarify, the broker does not actually use the ApiVersion API
>> for
>>>>>>> inter-broker communications. The use of an API and its
>> corresponding
>>>>>>> version is controlled by `inter.broker.protocol.version`.
>>>>>>>
>>>>>>> Nevertheless, it sounds like we're on the same page about removing
>>>>>>> DescribeTransactionState. The impact of a dangling transaction is
>> a
>>>>>> little
>>>>>>> worse than what you describe though. Consumers with the
>>>> read_committed
>>>>>>> isolation level will be stuck. Still, I think we agree that this
>>>> case
>>>>>>> should be rare and we can reconsider for future work. Rather than
>>>>>>> preventing dangling transactions, perhaps we should consider
>> options
>>>>>> which
>>>>>>> allows us to detect them and recover. Anyway, this needs more
>>>> thought.
>>>>> I
>>>>>>> will update the KIP.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jason
>>>>>>>
>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangg...@gmail.com
>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>> 0. My original question is about the implementation details
>>>>> primarily,
>>>>>>>> since current the handling logic of the APIVersionResponse is
>>>> simply
>>>>>> "use
>>>>>>>> the highest supported version of the corresponding request", but
>>>> if
>>>>> the
>>>>>>>> returned response from APIVersionRequest says "I don't even know
>>>>> about
>>>>>>> the
>>>>>>>> DescribeTransactionStateRequest at all", then we need additional
>>>>> logic
>>>>>>> for
>>>>>>>> the falling back logic. Currently this logic is embedded in
>>>>>> NetworkClient
>>>>>>>> which is shared by all clients, so I'd like to avoid making this
>>>>> logic
>>>>>>> more
>>>>>>>> complicated.
>>>>>>>>
>>>>>>>> As for the general issue that a broker does not recognize a
>>>> producer
>>>>>> with
>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
>>>> wiki,
>>>>>> this
>>>>>>>> is only a concern for transactional producer since for
>> idempotent
>>>>>>> producer
>>>>>>>> it can just bump the epoch and go. For transactional producer,
>>>> even
>>>>> if
>>>>>>> the
>>>>>>>> producer request from a fenced producer gets accepted, its
>>>>> transaction
>>>>>>> will
>>>>>>>> never be committed and hence messages not exposed to
>>>> read-committed
>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
>>>>>> consumers
>>>>>>>> will still read those messages, 2) unnecessary storage for those
>>>>> fenced
>>>>>>>> produce messages, but in practice should not accumulate to a
>> large
>>>>>> amount
>>>>>>>> since producer should soon try to commit and be told it is
>> fenced
>>>> and
>>>>>>> then
>>>>>>>> stop, 3) there will be no markers for those transactional
>> messages
>>>>>> ever.
>>>>>>>> Looking at the list and thinking about the likelihood it may
>>>> happen
>>>>>>>> assuming we retain the producer up to transactional.id.timeout
>>>>> (default
>>>>>>> is
>>>>>>>> 7 days), I feel comfortable leaving it as is.
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
>>>> ja...@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang,
>>>>>>>>>
>>>>>>>>> Thanks for the comments. Responses below:
>>>>>>>>>
>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
>>>>> using
>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
>>>>> upgraded,
>>>>>>> we
>>>>>>>>> will just fallback to the old logic, which is to accept the
>>>> write.
>>>>>> This
>>>>>>>> is
>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
>> Does
>>>>> that
>>>>>>>> seem
>>>>>>>>> reasonable?
>>>>>>>>>
>>>>>>>>> To tell the truth, after digging this KIP up and reading it
>>>> over, I
>>>>>> am
>>>>>>>>> doubting how crucial this API is. It is attempting to protect
>> a
>>>>> write
>>>>>>>> from
>>>>>>>>> a zombie which has just reset its sequence number after that
>>>>> producer
>>>>>>> had
>>>>>>>>> had its state cleaned up. However, one of the other
>>>> improvements in
>>>>>>> this
>>>>>>>>> KIP is to maintain producer state beyond its retention in the
>>>> log.
>>>>> I
>>>>>>>> think
>>>>>>>>> that makes this case sufficiently unlikely that we can leave
>> it
>>>> for
>>>>>>>> future
>>>>>>>>> work. I am not 100% sure this is the only scenario where
>>>>> transaction
>>>>>>>> state
>>>>>>>>> and log state can diverge anyway, so it would be better to
>>>> consider
>>>>>>> this
>>>>>>>>> problem more generally. What do you think?
>>>>>>>>>
>>>>>>>>> 1. Thanks, from memory, the term changed after the first
>>>> iteration.
>>>>>>> I'll
>>>>>>>>> make a pass and try to clarify usage.
>>>>>>>>> 2. I was attempting to handle some edge cases since this check
>>>>> would
>>>>>> be
>>>>>>>>> asynchronous. In any case, if we drop this validation as
>>>> suggested
>>>>>>> above,
>>>>>>>>> then we can ignore this.
>>>>>>>>>
>>>>>>>>> -Jason
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
>>>> wangg...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Jason, thanks for the great write-up.
>>>>>>>>>>
>>>>>>>>>> 0. One question about the migration plan: "The new
>>>>>>> GetTransactionState
>>>>>>>>> API
>>>>>>>>>> and the new version of the transaction state message will
>> not
>>>> be
>>>>>> used
>>>>>>>>> until
>>>>>>>>>> the inter-broker version supports it." I'm not so clear
>> about
>>>> the
>>>>>>>>>> implementation details here: say a broker is on the newer
>>>> version
>>>>>> and
>>>>>>>> the
>>>>>>>>>> txn-coordinator is still on older version. Today the
>>>>>>> APIVersionsRequest
>>>>>>>>> can
>>>>>>>>>> only help upgrade / downgrade the request version, but not
>>>>>> forbidding
>>>>>>>>>> sending any. Are you suggesting we add additional logic on
>> the
>>>>>> broker
>>>>>>>>> side
>>>>>>>>>> to handle the case of "not sending the request"? If yes my
>>>>> concern
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> this will be some tech-debt code that will live long before
>>>> being
>>>>>>>>> removed.
>>>>>>>>>>
>>>>>>>>>> Some additional minor comments:
>>>>>>>>>>
>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
>>>> the
>>>>>> same
>>>>>>>>> thing
>>>>>>>>>> in your wiki.
>>>>>>>>>> 2. "The broker must verify after receiving the response that
>>>> the
>>>>>>>> producer
>>>>>>>>>> state is still unknown.": not sure why we have to validate?
>> If
>>>>> the
>>>>>>>>> metadata
>>>>>>>>>> returned from the txn-coordinator can always be considered
>> the
>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
>>>> cache?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
>>>>>>> matth...@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am +1 on this :)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
>> link
>>>> to
>>>>>> the
>>>>>>>> KIP:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> -Jason
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
>>>>>>>> ja...@confluent.io
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a proposal to improve the
>> transactional/idempotent
>>>>>>>> producer's
>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
>>>> result
>>>>> of
>>>>>>>>> losing
>>>>>>>>>>>>> producer state following segment removal. The current
>>>>> behavior
>>>>>>> is
>>>>>>>>> both
>>>>>>>>>>>>> complex and limited. Please take a look and let me know
>>>> what
>>>>>> you
>>>>>>>>>> think.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
>>>>> initial
>>>>>>>> draft.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Reply via email to