Can we update the KIP to clearly document these decisions?

Thanks,

Justine

On Tue, Jul 9, 2024 at 9:25 AM Andrew Schofield <andrew_schofi...@live.com>
wrote:

> Hi Chris,
> As it stands, the error handling for transactions in KafkaProducer is not
> ideal. There’s no reason why a failed operation should fail a transaction
> provided that the application can tell that the operation was not included
> in the transaction and then make its own decision whether to continue or
> back out. So, I think I disagree with the original premise of a client-side
> error state for a transaction, but we are where we are.
>
> When I voted, I did not expect the KIP to handle ALL errors which could
> conceivably be handled. I did expect it to handle client-side send errors
> that would cause a record to be rejected from a batch before sending to a
> broker. I think that it does make the KafkaProducer interface very slightly
> more complicated, but the new option is a clear improvement and I
> don’t see anyone getting into a mess using it.
>
> I think broker-side errors are more tricky and I don’t think an overload
> on the send() method is going to do the job. I don’t see that as a problem
> with the KIP, just that the underlying RPCs and behaviour is not very
> amenable to record-specific error handling. The Produce RPC is a
> complicated beast which can include a set of records for mutiple
> topic-partitions. Although ProduceResponse v10 does include record
> errors, I don’t believe this is surfaced in the client. Let’s imagine
> something
> like broker-side record validation which barfs on one record. Failing an
> entire batch is easier, but less useful if the problem is related to one
> record.
>
> In summary, I’m happy that my vote stands, and I am happy with the KIP
> only supporting client-side record errors.
>
> Thanks,
> Andrew
>
> > On 8 Jul 2024, at 16:37, Chris Egerton <chr...@aiven.io.INVALID> wrote:
> >
> > Hi Alieh,
> >
> > Can you clarify why broker-side errors shouldn't be covered? The only
> real
> > rationale I can come up with is that it's easier to implement.
> >
> > "Things were better for Kafka Streams before KAFKA-9279 was fixed" isn't
> > very convincing, because Kafka Streams is not the only user of the Java
> > producer client. And for others, especially new users, I doubt that this
> > new API we're proposing would make sense without having to consult a lot
> of
> > historical context.
> >
> > I also don't think that most users will know or even care about the
> > distinction between errors that cause a record to fail before it's added
> to
> > a batch vs. after. If you were writing a producer application of your
> own,
> > and you wanted to handle RecordTooLargeException instances by dropping a
> > record without aborting a transaction, would you care about whether it
> was
> > your client or your broker that balked? Would you be happy if you wrote
> > logic expecting that that problem was solved once and for all, only to
> > learn that it could still affect you in other circumstances? Or,
> > alternatively, would you be happy if you wanted to solve that problem and
> > found an API that seemed to do exactly what you wanted, but after reading
> > the fine print, realized you'd have to do it yourself instead?
> >
> > Ultimately, the more I think about this, the more I believe that we're
> > adding noise to the API (with the new overloaded variant of send) for a
> > feature that will likely bring confusion and even frustration to anyone
> > besides maintainers of Kafka Streams who tries to use it.
> >
> > If the only concern about covering broker-side errors is that it would be
> > more difficult to implement, I believe we should strongly reconsider that
> > alternative. That said, if there is a straightforward way to explain this
> > feature to new users that won't mislead them or require them to do
> research
> > on producer internals, then I can still live with it.
> >
> > Regarding a list of recoverable vs. irrecoverable errors, this is
> actually
> > the subject of another recently-introduced KIP:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1050%3A+Consistent+error+handling+for+Transactions
> >
> > Finally, I'd also like to ask the people who have already voted (Andrew,
> > Matthias) if, at the time they voted, they believed that the API would
> > handle all errors, or only the subset of errors that would cause a record
> > to be rejected from a batch before it can be sent to a broker.
> >
> > Best,
> >
> > Chris
> >
> > On Thu, Jul 4, 2024 at 12:43 PM Alieh Saeedi
> <asae...@confluent.io.invalid>
> > wrote:
> >
> >> Salut from the KIP’s author
> >>
> >>
> >> Clarifying two points:
> >>
> >>
> >> 1) broker side errors:
> >>
> >> As far as I remember we are not going to cover the errors originating
> from
> >> the broker!
> >>
> >> A historical fact: One of the debate points in KIP-1038 was that by
> >> defining a producer custom handler, the user may assume that broker-side
> >> errors must be covered as well. They may define a handler for handling
> >> `RecordTooLargeException` and still see such errors not being handled as
> >> they wish.
> >>
> >>
> >> 2) Regarding irrecoverable/recoverable errors:
> >>
> >> Before the fix of `KAFKA-9279`,  errors such as
> `RecordTooLargeException`
> >> or errors related to missing meta data (both originating from Producer
> >> `send()`) were considered as recoverable but after that they turned into
> >> being irrecoverable without changing any Javadocs or having any KIP.
> All
> >> the effort made in this KIP and the former one have been towards
> returning
> >> to the former state.
> >>
> >>
> >> I am sure that it is clear for you that which sort of errors we are
> going
> >> to cover: A single record may happen to NOT get added to the batch due
> to
> >> the issues with the record or its corresponding topic. The point was
> that
> >> if the record is not added to the batch let ’s don’t fail the whole
> batch
> >> because of that non-existing record. We never intended to do sth in
> broker
> >> side or ignore more important errors.  But I agree with you Chris. If we
> >> are adding a new API, we must have good documentation for that. The
> >> sentence `all irrecoverable transactional errors will still be fatal` as
> >> you suggested is good. What do you think? I am totally against
> enumerating
> >> errors in Javadocs since these sort of errors can be changing during
> >> time.  More
> >> over, have you ever seen any list of recoverable or irrecoverable errors
> >> somewhere so far?
> >>
> >>
> >> Bests,
> >>
> >> Alieh
> >>
> >> On Wed, Jul 3, 2024 at 6:07 PM Chris Egerton <chr...@aiven.io.invalid>
> >> wrote:
> >>
> >>> Hi Justine,
> >>>
> >>> I agree that enumerating a list of errors that should be covered by the
> >> KIP
> >>> is difficult; I was thinking it might be easier if we list the errors
> >> that
> >>> should _not_ be covered by the KIP, and only if we can't define a
> >>> reasonable heuristic that would cover them without having to explicitly
> >>> list them. Could it be enough to say "all irrecoverable transactional
> >>> errors will still be fatal", or even just "all transactional errors (as
> >>> opposed to errors related to this specific record) will still be
> fatal"?
> >>>
> >>> Cheers,
> >>>
> >>> Chris
> >>>
> >>> On Wed, Jul 3, 2024 at 11:56 AM Justine Olshan
> >>> <jols...@confluent.io.invalid>
> >>> wrote:
> >>>
> >>>> Hey Chris,
> >>>>
> >>>> I think what you say makes sense. I agree that defining the behavior
> >>> based
> >>>> on code that can possibly change is not a good idea, and I was trying
> >> to
> >>>> get a clearer definition from the KIP's author :)
> >>>>
> >>>> I think it can always be hard to ensure that only specific errors are
> >>>> handled unless they are explicitly enumerated in code as the code can
> >>>> change and can be changed by folks who are not aware of this KIP or
> >>>> conversation.
> >>>> I personally don't have the bandwidth to do this
> definition/enumeration
> >>> of
> >>>> errors, so hopefully Alieh can expand upon this.
> >>>>
> >>>> Justine
> >>>>
> >>>> On Wed, Jul 3, 2024 at 8:28 AM Chris Egerton <chr...@aiven.io.invalid
> >
> >>>> wrote:
> >>>>
> >>>>> Hi Alieh,
> >>>>>
> >>>>> I don't love defining the changes for this KIP in terms of a catch
> >>> clause
> >>>>> in the KafkaProducer class, for two reasons. First, the set of errors
> >>>> that
> >>>>> are handled by that clause may shift over time as the code base is
> >>>>> modified, and second, it would be fairly opaque to users who want to
> >>>>> understand whether an error would be affected by using this API or
> >> not.
> >>>>>
> >>>>> It also seems strange that we'd handle some types of
> >>>>> RecordTooLargeException (i.e., ones reported client-side) with this
> >>> API,
> >>>>> but not others (i.e., ones reported by a broker).
> >>>>>
> >>>>> I think this kind of API would be most powerful, most intuitive to
> >>> users,
> >>>>> and easiest to document if we expanded the scope to all
> >>>> record-send-related
> >>>>> errors, except anything indicating issues with exactly-once
> >> semantics.
> >>>> That
> >>>>> would include records that are too large (when caught both client-
> >> and
> >>>>> server-side), records that can't be sent due to authorization
> >> failures,
> >>>>> records sent to nonexistent topics/topic partitions, and keyless
> >>> records
> >>>>> sent to compacted topics. It would not include
> >>>>> ProducerFencedException, InvalidProducerEpochException,
> >>>>> UnsupportedVersionException,
> >>>>> and possibly others.
> >>>>>
> >>>>> @Justine -- do you think it would be possible to develop either a
> >>> better
> >>>>> definition for the kinds of "excluded" errors that should not be
> >>> covered
> >>>> by
> >>>>> this API, or, barring that, a comprehensive list of exact error
> >> types?
> >>>> And
> >>>>> do you think this would be acceptable in terms of risk and
> >> complexity?
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Chris
> >>>>>
> >>>>> On Tue, Jul 2, 2024 at 5:05 PM Alieh Saeedi
> >>> <asae...@confluent.io.invalid
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hey Justine,
> >>>>>>
> >>>>>> About the consequences: the consequences will be like when we did
> >> not
> >>>>> have
> >>>>>> the fix made in `KAFKA-9279`: silent loss of data! Obviously, when
> >>> the
> >>>>> user
> >>>>>> intentionally chose to ignore errors, that would not be silent any
> >>>> more.
> >>>>>> Right?
> >>>>>> Of course, considering all types of `ApiException`s would be too
> >>> broad.
> >>>>> But
> >>>>>> are the exceptions caught in `catch(ApiException e)` of the
> >>> `doSend()`
> >>>>>> method also too broad?
> >>>>>>
> >>>>>> -Alieh
> >>>>>>
> >>>>>> On Tue, Jul 2, 2024 at 9:45 PM Justine Olshan
> >>>>> <jols...@confluent.io.invalid
> >>>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hey Alieh,
> >>>>>>>
> >>>>>>> If we want to allow any error to be ignored we should probably
> >> run
> >>>>>> through
> >>>>>>> all the errors to make sure they make sense.
> >>>>>>> I just want to feel confident that we aren't just making a
> >> decision
> >>>>>> without
> >>>>>>> considering the consequences carefully.
> >>>>>>>
> >>>>>>> Justine
> >>>>>>>
> >>>>>>> On Tue, Jul 2, 2024 at 12:30 PM Alieh Saeedi
> >>>>>> <asae...@confluent.io.invalid
> >>>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey Justine,
> >>>>>>>>
> >>>>>>>> yes we talked about `RecordTooLargeException` as an example,
> >> but
> >>>> did
> >>>>> we
> >>>>>>>> ever limit ourselves to only this specific exception? I think
> >>>> neither
> >>>>>> in
> >>>>>>>> the KIP nor in the PR.  As Chris mentioned, this KIP is going
> >> to
> >>>> undo
> >>>>>>> what
> >>>>>>>> we have done in `KAFKA-9279` in case 1) the user is in a
> >>>> transaction
> >>>>>> and
> >>>>>>> 2)
> >>>>>>>> he decides to ignore the errors in which the record was not
> >> even
> >>>>> added
> >>>>>> to
> >>>>>>>> the batch. Yes, and we suggested some methods for undoing or,
> >> in
> >>>>> fact,
> >>>>>>>> moving back the transaction from the error state in `flush` or
> >> in
> >>>>>>>> `commitTnx` and we finally came to the idea of not even doing
> >> the
> >>>>>> changes
> >>>>>>>> (better than undoing) in `send`.
> >>>>>>>>
> >>>>>>>> Bests,
> >>>>>>>> Alieh
> >>>>>>>>
> >>>>>>>> On Tue, Jul 2, 2024 at 8:03 PM Justine Olshan
> >>>>>>> <jols...@confluent.io.invalid
> >>>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey folks,
> >>>>>>>>>
> >>>>>>>>> I understand where you are coming from by asking for specific
> >>> use
> >>>>>>> cases.
> >>>>>>>> My
> >>>>>>>>> understanding based on previous conversations was that there
> >>>> were a
> >>>>>> few
> >>>>>>>>> different errors that have been seen.
> >>>>>>>>> One example I heard some information about was when the
> >> record
> >>>> was
> >>>>>> too
> >>>>>>>>> large and it fails the batch. Besides that, I'm not really
> >> sure
> >>>> if
> >>>>>>> there
> >>>>>>>>> are cases in mind, though it is fair to ask on those and
> >> bring
> >>>> them
> >>>>>> up.
> >>>>>>>>>
> >>>>>>>>>> Does a record qualify as a poison pill if it targets a
> >> topic
> >>>> that
> >>>>>>>>> doesn't exist? Or if it targets a topic that the producer
> >>>> principal
> >>>>>>> lacks
> >>>>>>>>> ACLs for? What if it fails broker-side validation (e.g., has
> >> a
> >>>> null
> >>>>>> key
> >>>>>>>> for
> >>>>>>>>> a compacted topic)?
> >>>>>>>>>
> >>>>>>>>> I think there was some parallel work with addressing the
> >>>>>>>>> UnknownTopicOrPartitionError in another way. As for the other
> >>>>> checks,
> >>>>>>>> acls,
> >>>>>>>>> validation etc. I am not aware of that being in Alieh's
> >> scope,
> >>>> but
> >>>>> we
> >>>>>>>>> should be clear about exactly what we are doing.
> >>>>>>>>>
> >>>>>>>>> All errors that fall into ApiException seems too broad to me.
> >>>>>>>>>
> >>>>>>>>> Justine
> >>>>>>>>>
> >>>>>>>>> On Tue, Jul 2, 2024 at 10:51 AM Alieh Saeedi
> >>>>>>>> <asae...@confluent.io.invalid
> >>>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey Chris,
> >>>>>>>>>> thanks for sharing your concerns.
> >>>>>>>>>>
> >>>>>>>>>> 1) About the language of KIP (or maybe later in Javadocs):
> >> Is
> >>>>> that
> >>>>>>>>> alright
> >>>>>>>>>> if I write all errors that fall into the `ApiException`
> >>>> category
> >>>>>>> thrown
> >>>>>>>>>> (actually returned) by Producer?
> >>>>>>>>>> 2) About future expansion: do you have any better
> >> suggestions
> >>>> for
> >>>>>>> enum
> >>>>>>>>>> names? Do you think `IGNORE_API_EXEPTIONS` or something
> >> like
> >>>> that
> >>>>>> is
> >>>>>>> a
> >>>>>>>>>> "better/more accurate" one?
> >>>>>>>>>>
> >>>>>>>>>> Bests,
> >>>>>>>>>> Alieh
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jul 2, 2024 at 7:29 PM Chris Egerton
> >>>>>> <chr...@aiven.io.invalid
> >>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Alieh and Justine,
> >>>>>>>>>>>
> >>>>>>>>>>> I'm concerned that we're settling on a definition of
> >>> "poison
> >>>>>> pill"
> >>>>>>>>> that's
> >>>>>>>>>>> easiest to tackle right now but may lead to shortcomings
> >>> down
> >>>>> the
> >>>>>>>>> road. I
> >>>>>>>>>>> understand the relationship between this KIP and
> >>> KAFKA-9279,
> >>>>> and
> >>>>>> I
> >>>>>>>> can
> >>>>>>>>>>> totally get behind the desire to keep things small,
> >>> focused,
> >>>>> and
> >>>>>>>> simple
> >>>>>>>>>> in
> >>>>>>>>>>> the name of avoiding bugs. However, what I don't think is
> >>>> clear
> >>>>>> at
> >>>>>>>> all
> >>>>>>>>> is
> >>>>>>>>>>> what the "specific circumstances" are that Justine
> >>>> mentioned. I
> >>>>>>> had a
> >>>>>>>>>>> drastically different idea of what the intended
> >> behavioral
> >>>>> change
> >>>>>>>> would
> >>>>>>>>>> be
> >>>>>>>>>>> before looking at the draft PR.
> >>>>>>>>>>>
> >>>>>>>>>>> I would like 1) for us to be clearer about the categories
> >>> of
> >>>>>> errors
> >>>>>>>>> that
> >>>>>>>>>> we
> >>>>>>>>>>> want to cover with this new API (especially since we'll
> >>> have
> >>>> to
> >>>>>>> find
> >>>>>>>> a
> >>>>>>>>>>> clear, succinct way to document this for users), and 2)
> >> to
> >>>> make
> >>>>>>> sure
> >>>>>>>>> that
> >>>>>>>>>>> if we do try to expand this API in the future, that we
> >>> won't
> >>>> be
> >>>>>>>> painted
> >>>>>>>>>>> into a corner.
> >>>>>>>>>>>
> >>>>>>>>>>> For item 1, hopefully we can agree that the language in
> >> the
> >>>> KIP
> >>>>>>>>>>> for IGNORE_SEND_ERRORS ("The records causing
> >> irrecoverable
> >>>>> errors
> >>>>>>> are
> >>>>>>>>>>> excluded from the batch and the transaction is committed
> >>>>>>>>> successfully.")
> >>>>>>>>>> is
> >>>>>>>>>>> pretty vague. If we start using the phrase "poison pill
> >>>> record"
> >>>>>>> that
> >>>>>>>>>> could
> >>>>>>>>>>> help, but IMO more detail would still be needed. We know
> >>> that
> >>>>> we
> >>>>>>> want
> >>>>>>>>> to
> >>>>>>>>>>> include records that are so large that they can be
> >>>> immediately
> >>>>>>>> rejected
> >>>>>>>>>> by
> >>>>>>>>>>> the producer. But there are other cases that users might
> >>>> expect
> >>>>>> to
> >>>>>>> be
> >>>>>>>>>>> handled. Does a record qualify as a poison pill if it
> >>>> targets a
> >>>>>>> topic
> >>>>>>>>>> that
> >>>>>>>>>>> doesn't exist? Or if it targets a topic that the producer
> >>>>>> principal
> >>>>>>>>> lacks
> >>>>>>>>>>> ACLs for? What if it fails broker-side validation (e.g.,
> >>> has
> >>>> a
> >>>>>> null
> >>>>>>>> key
> >>>>>>>>>> for
> >>>>>>>>>>> a compacted topic)?
> >>>>>>>>>>>
> >>>>>>>>>>> For item 2, this really depends on how narrow the scope
> >> of
> >>>> what
> >>>>>>> we're
> >>>>>>>>>> doing
> >>>>>>>>>>> right now is. If we only handle a subset of the examples
> >> I
> >>>> laid
> >>>>>> out
> >>>>>>>>> above
> >>>>>>>>>>> that could possibly be considered poison pills with this
> >>> KIP,
> >>>>> do
> >>>>>> we
> >>>>>>>>> want
> >>>>>>>>>> to
> >>>>>>>>>>> lock ourselves in to never addressing more in the future,
> >>> or
> >>>>> can
> >>>>>> we
> >>>>>>>>>> choose
> >>>>>>>>>>> an API (probably just enum names would be the only
> >>> important
> >>>>>>> decision
> >>>>>>>>>> here)
> >>>>>>>>>>> that leaves room for more later?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Chris
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Jul 2, 2024 at 12:28 PM Justine Olshan
> >>>>>>>>>>> <jols...@confluent.io.invalid>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Chris and Alieh,
> >>>>>>>>>>>>
> >>>>>>>>>>>> My understanding is that this KIP is really only trying
> >>> to
> >>>>>> solve
> >>>>>>> an
> >>>>>>>>>> issue
> >>>>>>>>>>>> of a "poison pill" record that fails send().
> >>>>>>>>>>>> We've talked a lot about having a generic framework for
> >>> all
> >>>>>>> errors,
> >>>>>>>>>> but I
> >>>>>>>>>>>> don't think that is what this KIP is trying to do.
> >>>>> Essentially
> >>>>>>> the
> >>>>>>>>>>> request
> >>>>>>>>>>>> is to undo the change from KAFKA-9279
> >>>>>>>>>>>> <https://issues.apache.org/jira/browse/KAFKA-9279> but
> >>>> under
> >>>>>>>>> specific
> >>>>>>>>>>>> circumstances that are controlled. I really am
> >> concerned
> >>>>> about
> >>>>>>>>> opening
> >>>>>>>>>>> new
> >>>>>>>>>>>> avenues for bugs with EOS and hesitate to handle any
> >>> other
> >>>>>> types
> >>>>>>> of
> >>>>>>>>>>> errors.
> >>>>>>>>>>>> I think if we all agree on the problem that we are
> >> trying
> >>>> to
> >>>>>>> solve,
> >>>>>>>>> it
> >>>>>>>>>> is
> >>>>>>>>>>>> easier to agree on solutions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Justine
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Jul 1, 2024 at 2:20 AM Alieh Saeedi
> >>>>>>>>>> <asae...@confluent.io.invalid
> >>>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>> Thanks for the valid points you mentioned. I updated
> >>> the
> >>>>> KIP
> >>>>>>> and
> >>>>>>>>> the
> >>>>>>>>>> PR
> >>>>>>>>>>>>> with:
> >>>>>>>>>>>>> 1) mentioning that the new overloaded `send` throws
> >>>>>>>>>>>> `IllegalStateException`
> >>>>>>>>>>>>> if the user tries to ignore `send()` errors outside
> >> of
> >>> a
> >>>>>>>>> transaction.
> >>>>>>>>>>>>> 2) the default implementation in `Producer` interface
> >>>>> throws
> >>>>>> an
> >>>>>>>>>>>>> `UnsupportedOperationException`
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Chris,
> >>>>>>>>>>>>> Thanks for the feedback. I tried to clarify the
> >> points
> >>>> you
> >>>>>>>> listed:
> >>>>>>>>>>>>> -------> we've narrowed the scope from any error that
> >>>> might
> >>>>>>> take
> >>>>>>>>>> place
> >>>>>>>>>>>> with
> >>>>>>>>>>>>> producing a record to Kafka, to only the ones that
> >> are
> >>>>> thrown
> >>>>>>>>>> directly
> >>>>>>>>>>>> from
> >>>>>>>>>>>>> Producer::send;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> From the very beginning and even since KIP-1038, the
> >>> main
> >>>>>>> purpose
> >>>>>>>>> was
> >>>>>>>>>>> to
> >>>>>>>>>>>>> have "more flexibility," or, in other words, "giving
> >>> the
> >>>>> user
> >>>>>>> the
> >>>>>>>>>>>>> authority" to handle some specific exceptions thrown
> >>> from
> >>>>> the
> >>>>>>>>>>> `Producer`.
> >>>>>>>>>>>>> Due to the specific cases we had in mind, KIP-1038
> >> was
> >>>>>>> discarded
> >>>>>>>>> and
> >>>>>>>>>> we
> >>>>>>>>>>>>> decided to not define a `CustomExceptionHandler` for
> >>>>>> `Producer`
> >>>>>>>> and
> >>>>>>>>>>>> instead
> >>>>>>>>>>>>> treat the `send` failures in a different way. The
> >> main
> >>>>> issue
> >>>>>> is
> >>>>>>>>> that
> >>>>>>>>>>>> `send`
> >>>>>>>>>>>>> makes a transition to error state, which is undoable.
> >>> In
> >>>>>> fact,
> >>>>>>>> one
> >>>>>>>>>>> single
> >>>>>>>>>>>>> poison pill record makes the whole batch fail. The
> >>> former
> >>>>>>>>> suggestions
> >>>>>>>>>>>> that
> >>>>>>>>>>>>> you agreed with have been all about un-doing this
> >>>>> transition
> >>>>>> in
> >>>>>>>>>> `flush`
> >>>>>>>>>>>> or
> >>>>>>>>>>>>> `commit`. The new suggestion is to un-do (or better,
> >>> NOT
> >>>>> do)
> >>>>>> in
> >>>>>>>>>> `send`
> >>>>>>>>>>>> due
> >>>>>>>>>>>>> to the reasons listed in the discussions above.
> >>>>>>>>>>>>> Moreover, I would say that having such a large scope
> >> as
> >>>> you
> >>>>>>>>> mentioned
> >>>>>>>>>>> is
> >>>>>>>>>>>>> impossible. In the best case, we may have control
> >> over
> >>>> the
> >>>>>>>>>> `Producer`.
> >>>>>>>>>>>> What
> >>>>>>>>>>>>> shall we do with the broker? The `any error that
> >> might
> >>>> take
> >>>>>>> place
> >>>>>>>>>> with
> >>>>>>>>>>>>> producing a record to Kafka` is too much, I think.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------> is this all we want to handle, and will it
> >>>> prevent
> >>>>>> us
> >>>>>>>> from
> >>>>>>>>>>>>> handling more in the future in an intuitive way?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think yes. This is all we want. Other sorts of
> >> errors
> >>>>> such
> >>>>>> as
> >>>>>>>>>> having
> >>>>>>>>>>>>> problem with partition addition, producer fenced
> >>>> exception,
> >>>>>> etc
> >>>>>>>>> seem
> >>>>>>>>>> to
> >>>>>>>>>>>> be
> >>>>>>>>>>>>> more serious issues. The intention was to handle
> >>> problems
> >>>>>>> created
> >>>>>>>>> by
> >>>>>>>>>>>>> (maybe) a single poison pill record. BTW, I do not
> >> see
> >>>> any
> >>>>>>>>> obstacles
> >>>>>>>>>> to
> >>>>>>>>>>>>> future changes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>> Alieh
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sat, Jun 29, 2024 at 3:03 AM Chris Egerton
> >>>>>>>>>> <chr...@aiven.io.invalid
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ah, sorry--spoke too soon. The PR doesn't show that
> >>>>> errors
> >>>>>>>> thrown
> >>>>>>>>>>> from
> >>>>>>>>>>>>>> Producer::send are handled, but instead,
> >> ApiException
> >>>>>>> instances
> >>>>>>>>>> that
> >>>>>>>>>>>> are
> >>>>>>>>>>>>>> caught inside KafkaProducer::doSend and are handled
> >>> by
> >>>>>>>> returning
> >>>>>>>>> an
> >>>>>>>>>>>>>> already-failed future are. I think the same
> >> question
> >>>>> still
> >>>>>>>>> applies
> >>>>>>>>>>> (is
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>> all we want to handle, and will it prevent us from
> >>>>> handling
> >>>>>>>> more
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> future in an intuitive way), though.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Jun 28, 2024 at 8:57 PM Chris Egerton <
> >>>>>>> chr...@aiven.io
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Alieh,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This KIP has evolved a lot since I last looked at
> >>> it,
> >>>>> but
> >>>>>>> the
> >>>>>>>>>>> changes
> >>>>>>>>>>>>>> seem
> >>>>>>>>>>>>>>> well thought-out both in semantics and API. One
> >>>>>> clarifying
> >>>>>>>>>>> question I
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>> is that it looks based on the draft PR that we've
> >>>>>> narrowed
> >>>>>>>> the
> >>>>>>>>>>> scope
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>>>> any error that might take place with producing a
> >>>> record
> >>>>>> to
> >>>>>>>>> Kafka,
> >>>>>>>>>>> to
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>>> the ones that are thrown directly from
> >>>> Producer::send;
> >>>>> is
> >>>>>>>> that
> >>>>>>>>>> the
> >>>>>>>>>>>>>> intended
> >>>>>>>>>>>>>>> behavior here? And if so, do you have thoughts on
> >>> how
> >>>>> we
> >>>>>>>> might
> >>>>>>>>>>>> design a
> >>>>>>>>>>>>>>> follow-up KIP that would catch all errors
> >>> (including
> >>>>> ones
> >>>>>>>>>> reported
> >>>>>>>>>>>>>>> asynchronously instead of synchronously)? I'd
> >> like
> >>> it
> >>>>> if
> >>>>>> we
> >>>>>>>>> could
> >>>>>>>>>>>> leave
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> door open for that without painting ourselves
> >> into
> >>>> too
> >>>>>> much
> >>>>>>>> of
> >>>>>>>>> a
> >>>>>>>>>>>> corner
> >>>>>>>>>>>>>>> with the API design for this KIP.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Chris
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Jun 28, 2024 at 6:31 PM Matthias J. Sax <
> >>>>>>>>>> mj...@apache.org>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks Alieh,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> it seems this KIP can just pick between a couple
> >>> of
> >>>>>>>> tradeoffs.
> >>>>>>>>>>>> Adding
> >>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>> overloaded `send()` as the KIP propose makes
> >> sense
> >>>> to
> >>>>> me
> >>>>>>> and
> >>>>>>>>>> seems
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> provides the cleanest solution compare to there
> >>>>> options
> >>>>>> we
> >>>>>>>>>>>> discussed.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Given the explicit name of the passed-in option
> >>> that
> >>>>>>>>> highlights
> >>>>>>>>>>> that
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> option is for TX only make is pretty clear and
> >>>> avoids
> >>>>>> the
> >>>>>>>>> issue
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>> `flush()` ambiguity.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Nit: We should make clear on the KIP though,
> >> that
> >>>> the
> >>>>>> new
> >>>>>>>>>> `send()`
> >>>>>>>>>>>>>>>> overload would throw an `IllegalStateException`
> >> if
> >>>> TX
> >>>>>> are
> >>>>>>>> not
> >>>>>>>>>> used
> >>>>>>>>>>>>>>>> (similar to other TX methods like initTx(), etc)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> About the `Producer` interface, I am not sure
> >> how
> >>>> this
> >>>>>> was
> >>>>>>>>> done
> >>>>>>>>>> in
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> past (eg, KIP-266 added
> >> `Consumer.poll(Duration)`
> >>>> w/o
> >>>>> a
> >>>>>>>>> default
> >>>>>>>>>>>>>>>> implementation), if we need a default
> >>> implementation
> >>>>> for
> >>>>>>>>>> backward
> >>>>>>>>>>>>>>>> compatibility or not? If we do want to add one,
> >> I
> >>>>> think
> >>>>>> it
> >>>>>>>>> would
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>> appropriate to throw an
> >>>>> `UnsupportedOperationException`
> >>>>>> by
> >>>>>>>>>>> default,
> >>>>>>>>>>>>>>>> instead of just keeping the default impl empty?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My points are rather minor, and should not block
> >>>> this
> >>>>>> KIP
> >>>>>>>>>> though.
> >>>>>>>>>>>>>>>> Overall LGTM.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 6/27/24 1:28 PM, Alieh Saeedi wrote:
> >>>>>>>>>>>>>>>>> Hi Justine,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the suggestion.
> >>>>>>>>>>>>>>>>> Making applications to validate every single
> >>>> record
> >>>>> is
> >>>>>>> not
> >>>>>>>>> the
> >>>>>>>>>>>> best
> >>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>> from an efficiency point of view.
> >>>>>>>>>>>>>>>>> Moreover, between changing the behavior of the
> >>>>>> Producer
> >>>>>>> in
> >>>>>>>>>>> `send`
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> `commitTnx`, the former seems more reasonable
> >>> and
> >>>>>> clean.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>>>>>> Alieh
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, Jun 27, 2024 at 8:14 PM Justine Olshan
> >>>>>>>>>>>>>>>> <jols...@confluent.io.invalid>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hey Alieh,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I see there are two options now. So folks
> >> will
> >>> be
> >>>>>>>>> discussing
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> approaches
> >>>>>>>>>>>>>>>>>> and deciding the best way forward before we
> >>> vote?
> >>>>>>>>>>>>>>>>>> I do think there could be a problem with the
> >>>>> approach
> >>>>>>> on
> >>>>>>>>>> commit
> >>>>>>>>>>>> if
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> get
> >>>>>>>>>>>>>>>>>> stuck on an earlier error and have more
> >> records
> >>>>>>>>> (potentially
> >>>>>>>>>> on
> >>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>> partitions) to commit as the current PR is
> >>>>>> implemented.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I guess this takes us back to the question of
> >>>>> whether
> >>>>>>> the
> >>>>>>>>>> error
> >>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> cleared on send.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> (And I guess at the back of my mind, I'm
> >>>> wondering
> >>>>> if
> >>>>>>>> there
> >>>>>>>>>> is
> >>>>>>>>>>> a
> >>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>> we can
> >>>>>>>>>>>>>>>>>> validate the "posion pill" records
> >> application
> >>>> side
> >>>>>>>> before
> >>>>>>>>> we
> >>>>>>>>>>>> even
> >>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> send them)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Justine
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, Jun 26, 2024 at 4:38 PM Alieh Saeedi
> >>>>>>>>>>>>>>>> <asae...@confluent.io.invalid
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Justine,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I did not update the KIP with
> >> `TxnSendOption`
> >>>>> since
> >>>>>> I
> >>>>>>>>>> thought
> >>>>>>>>>>>> it'd
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> better discussed here beforehand.
> >>>>>>>>>>>>>>>>>>> right now, there are 2 PRs:
> >>>>>>>>>>>>>>>>>>> - the PR that implements the current version
> >>> of
> >>>>> the
> >>>>>>> KIP:
> >>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/16332
> >>>>>>>>>>>>>>>>>>> - the POC PR that clarifies the
> >>> `TxnSendOption`:
> >>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/16465
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>>>>>>>> Alieh
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, Jun 27, 2024 at 12:42 AM Justine
> >>> Olshan
> >>>>>>>>>>>>>>>>>>> <jols...@confluent.io.invalid> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hey Alieh,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I think I am a little confused. Are the 3
> >>>> points
> >>>>>>> above
> >>>>>>>>>>>> addressed
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> KIP
> >>>>>>>>>>>>>>>>>>>> or did something change? The PR seems to
> >> not
> >>>>>> include
> >>>>>>>> this
> >>>>>>>>>>>> change
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> still
> >>>>>>>>>>>>>>>>>>>> has the CommitOption as well.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>> Justine
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, Jun 26, 2024 at 2:15 PM Alieh
> >> Saeedi
> >>>>>>>>>>>>>>>>>>> <asae...@confluent.io.invalid
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Looking at the PR <
> >>>>>>>>>>> https://github.com/apache/kafka/pull/16332
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> corresponding to the KIP, there are some
> >>>> points
> >>>>>>> worthy
> >>>>>>>>> of
> >>>>>>>>>>>>> mention:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1) clearing the error ends up dirty/messy
> >>> code
> >>>>> in
> >>>>>>>>>>>>>>>>>> `TransactionManager`.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2) By clearing the error, we are actually
> >>>> doing
> >>>>> an
> >>>>>>>>> illegal
> >>>>>>>>>>>>>>>> transition
> >>>>>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>>> `ABORTABLE_ERROR` to `IN_TRANSACTION`
> >> which
> >>> is
> >>>>>>>>>> conceptually
> >>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> acceptable.
> >>>>>>>>>>>>>>>>>>>>> This can be the root cause of some issues,
> >>>> with
> >>>>>>>> perhaps
> >>>>>>>>>>>> further
> >>>>>>>>>>>>>>>>>> future
> >>>>>>>>>>>>>>>>>>>>> changes by others.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 3) If the poison pill record `r1` causes a
> >>>>>>> transition
> >>>>>>>> to
> >>>>>>>>>> the
> >>>>>>>>>>>>> error
> >>>>>>>>>>>>>>>>>>> state
> >>>>>>>>>>>>>>>>>>>>> and then the next record `r2` requires
> >>> adding
> >>>> a
> >>>>>>>>> partition
> >>>>>>>>>> to
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> transaction, the action fails due to being
> >>> in
> >>>>> the
> >>>>>>>> error
> >>>>>>>>>>> state.
> >>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>> case, clearing errors during
> >>>>>>>>> `commitTnx(CLEAR_SEND_ERROR)`
> >>>>>>>>>>> is
> >>>>>>>>>>>>> too
> >>>>>>>>>>>>>>>>>> late.
> >>>>>>>>>>>>>>>>>>>>> However, this case can NOT be the main
> >>> concern
> >>>>> as
> >>>>>>> soon
> >>>>>>>>> as
> >>>>>>>>>>>>> KIP-890
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> fully
> >>>>>>>>>>>>>>>>>>>>> implemented.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> My suggestion is to solve the problem
> >> where
> >>> it
> >>>>>>> arises.
> >>>>>>>>> If
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> transition
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> the error state does not happen during
> >>>> `send()`,
> >>>>>> we
> >>>>>>> do
> >>>>>>>>> not
> >>>>>>>>>>>> need
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> clear
> >>>>>>>>>>>>>>>>>>>>> the error later. Therefore, instead of
> >>>>>>> `CommitOption`,
> >>>>>>>>> we
> >>>>>>>>>>> can
> >>>>>>>>>>>>>> define
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> `TxnSendOption` and prevent the `send()`
> >>>> method
> >>>>>> from
> >>>>>>>>> going
> >>>>>>>>>>> to
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> error
> >>>>>>>>>>>>>>>>>>>>> state in case 1) we're in a transaction
> >> and
> >>> 2)
> >>>>> the
> >>>>>>>> user
> >>>>>>>>>>> asked
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> IGONRE_SEND_ERRORS. For more clarity, you
> >>> can
> >>>>>> take a
> >>>>>>>>> look
> >>>>>>>>>> at
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> POC
> >>>>>>>>>>>>>>>>>> PR
> >>>>>>>>>>>>>>>>>>>>> <
> >> https://github.com/apache/kafka/pull/16465
> >>>> .
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>> Alieh
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Reply via email to