Can we update the KIP to clearly document these decisions? Thanks,
Justine On Tue, Jul 9, 2024 at 9:25 AM Andrew Schofield <andrew_schofi...@live.com> wrote: > Hi Chris, > As it stands, the error handling for transactions in KafkaProducer is not > ideal. There’s no reason why a failed operation should fail a transaction > provided that the application can tell that the operation was not included > in the transaction and then make its own decision whether to continue or > back out. So, I think I disagree with the original premise of a client-side > error state for a transaction, but we are where we are. > > When I voted, I did not expect the KIP to handle ALL errors which could > conceivably be handled. I did expect it to handle client-side send errors > that would cause a record to be rejected from a batch before sending to a > broker. I think that it does make the KafkaProducer interface very slightly > more complicated, but the new option is a clear improvement and I > don’t see anyone getting into a mess using it. > > I think broker-side errors are more tricky and I don’t think an overload > on the send() method is going to do the job. I don’t see that as a problem > with the KIP, just that the underlying RPCs and behaviour is not very > amenable to record-specific error handling. The Produce RPC is a > complicated beast which can include a set of records for mutiple > topic-partitions. Although ProduceResponse v10 does include record > errors, I don’t believe this is surfaced in the client. Let’s imagine > something > like broker-side record validation which barfs on one record. Failing an > entire batch is easier, but less useful if the problem is related to one > record. > > In summary, I’m happy that my vote stands, and I am happy with the KIP > only supporting client-side record errors. > > Thanks, > Andrew > > > On 8 Jul 2024, at 16:37, Chris Egerton <chr...@aiven.io.INVALID> wrote: > > > > Hi Alieh, > > > > Can you clarify why broker-side errors shouldn't be covered? The only > real > > rationale I can come up with is that it's easier to implement. > > > > "Things were better for Kafka Streams before KAFKA-9279 was fixed" isn't > > very convincing, because Kafka Streams is not the only user of the Java > > producer client. And for others, especially new users, I doubt that this > > new API we're proposing would make sense without having to consult a lot > of > > historical context. > > > > I also don't think that most users will know or even care about the > > distinction between errors that cause a record to fail before it's added > to > > a batch vs. after. If you were writing a producer application of your > own, > > and you wanted to handle RecordTooLargeException instances by dropping a > > record without aborting a transaction, would you care about whether it > was > > your client or your broker that balked? Would you be happy if you wrote > > logic expecting that that problem was solved once and for all, only to > > learn that it could still affect you in other circumstances? Or, > > alternatively, would you be happy if you wanted to solve that problem and > > found an API that seemed to do exactly what you wanted, but after reading > > the fine print, realized you'd have to do it yourself instead? > > > > Ultimately, the more I think about this, the more I believe that we're > > adding noise to the API (with the new overloaded variant of send) for a > > feature that will likely bring confusion and even frustration to anyone > > besides maintainers of Kafka Streams who tries to use it. > > > > If the only concern about covering broker-side errors is that it would be > > more difficult to implement, I believe we should strongly reconsider that > > alternative. That said, if there is a straightforward way to explain this > > feature to new users that won't mislead them or require them to do > research > > on producer internals, then I can still live with it. > > > > Regarding a list of recoverable vs. irrecoverable errors, this is > actually > > the subject of another recently-introduced KIP: > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1050%3A+Consistent+error+handling+for+Transactions > > > > Finally, I'd also like to ask the people who have already voted (Andrew, > > Matthias) if, at the time they voted, they believed that the API would > > handle all errors, or only the subset of errors that would cause a record > > to be rejected from a batch before it can be sent to a broker. > > > > Best, > > > > Chris > > > > On Thu, Jul 4, 2024 at 12:43 PM Alieh Saeedi > <asae...@confluent.io.invalid> > > wrote: > > > >> Salut from the KIP’s author > >> > >> > >> Clarifying two points: > >> > >> > >> 1) broker side errors: > >> > >> As far as I remember we are not going to cover the errors originating > from > >> the broker! > >> > >> A historical fact: One of the debate points in KIP-1038 was that by > >> defining a producer custom handler, the user may assume that broker-side > >> errors must be covered as well. They may define a handler for handling > >> `RecordTooLargeException` and still see such errors not being handled as > >> they wish. > >> > >> > >> 2) Regarding irrecoverable/recoverable errors: > >> > >> Before the fix of `KAFKA-9279`, errors such as > `RecordTooLargeException` > >> or errors related to missing meta data (both originating from Producer > >> `send()`) were considered as recoverable but after that they turned into > >> being irrecoverable without changing any Javadocs or having any KIP. > All > >> the effort made in this KIP and the former one have been towards > returning > >> to the former state. > >> > >> > >> I am sure that it is clear for you that which sort of errors we are > going > >> to cover: A single record may happen to NOT get added to the batch due > to > >> the issues with the record or its corresponding topic. The point was > that > >> if the record is not added to the batch let ’s don’t fail the whole > batch > >> because of that non-existing record. We never intended to do sth in > broker > >> side or ignore more important errors. But I agree with you Chris. If we > >> are adding a new API, we must have good documentation for that. The > >> sentence `all irrecoverable transactional errors will still be fatal` as > >> you suggested is good. What do you think? I am totally against > enumerating > >> errors in Javadocs since these sort of errors can be changing during > >> time. More > >> over, have you ever seen any list of recoverable or irrecoverable errors > >> somewhere so far? > >> > >> > >> Bests, > >> > >> Alieh > >> > >> On Wed, Jul 3, 2024 at 6:07 PM Chris Egerton <chr...@aiven.io.invalid> > >> wrote: > >> > >>> Hi Justine, > >>> > >>> I agree that enumerating a list of errors that should be covered by the > >> KIP > >>> is difficult; I was thinking it might be easier if we list the errors > >> that > >>> should _not_ be covered by the KIP, and only if we can't define a > >>> reasonable heuristic that would cover them without having to explicitly > >>> list them. Could it be enough to say "all irrecoverable transactional > >>> errors will still be fatal", or even just "all transactional errors (as > >>> opposed to errors related to this specific record) will still be > fatal"? > >>> > >>> Cheers, > >>> > >>> Chris > >>> > >>> On Wed, Jul 3, 2024 at 11:56 AM Justine Olshan > >>> <jols...@confluent.io.invalid> > >>> wrote: > >>> > >>>> Hey Chris, > >>>> > >>>> I think what you say makes sense. I agree that defining the behavior > >>> based > >>>> on code that can possibly change is not a good idea, and I was trying > >> to > >>>> get a clearer definition from the KIP's author :) > >>>> > >>>> I think it can always be hard to ensure that only specific errors are > >>>> handled unless they are explicitly enumerated in code as the code can > >>>> change and can be changed by folks who are not aware of this KIP or > >>>> conversation. > >>>> I personally don't have the bandwidth to do this > definition/enumeration > >>> of > >>>> errors, so hopefully Alieh can expand upon this. > >>>> > >>>> Justine > >>>> > >>>> On Wed, Jul 3, 2024 at 8:28 AM Chris Egerton <chr...@aiven.io.invalid > > > >>>> wrote: > >>>> > >>>>> Hi Alieh, > >>>>> > >>>>> I don't love defining the changes for this KIP in terms of a catch > >>> clause > >>>>> in the KafkaProducer class, for two reasons. First, the set of errors > >>>> that > >>>>> are handled by that clause may shift over time as the code base is > >>>>> modified, and second, it would be fairly opaque to users who want to > >>>>> understand whether an error would be affected by using this API or > >> not. > >>>>> > >>>>> It also seems strange that we'd handle some types of > >>>>> RecordTooLargeException (i.e., ones reported client-side) with this > >>> API, > >>>>> but not others (i.e., ones reported by a broker). > >>>>> > >>>>> I think this kind of API would be most powerful, most intuitive to > >>> users, > >>>>> and easiest to document if we expanded the scope to all > >>>> record-send-related > >>>>> errors, except anything indicating issues with exactly-once > >> semantics. > >>>> That > >>>>> would include records that are too large (when caught both client- > >> and > >>>>> server-side), records that can't be sent due to authorization > >> failures, > >>>>> records sent to nonexistent topics/topic partitions, and keyless > >>> records > >>>>> sent to compacted topics. It would not include > >>>>> ProducerFencedException, InvalidProducerEpochException, > >>>>> UnsupportedVersionException, > >>>>> and possibly others. > >>>>> > >>>>> @Justine -- do you think it would be possible to develop either a > >>> better > >>>>> definition for the kinds of "excluded" errors that should not be > >>> covered > >>>> by > >>>>> this API, or, barring that, a comprehensive list of exact error > >> types? > >>>> And > >>>>> do you think this would be acceptable in terms of risk and > >> complexity? > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Chris > >>>>> > >>>>> On Tue, Jul 2, 2024 at 5:05 PM Alieh Saeedi > >>> <asae...@confluent.io.invalid > >>>>> > >>>>> wrote: > >>>>> > >>>>>> Hey Justine, > >>>>>> > >>>>>> About the consequences: the consequences will be like when we did > >> not > >>>>> have > >>>>>> the fix made in `KAFKA-9279`: silent loss of data! Obviously, when > >>> the > >>>>> user > >>>>>> intentionally chose to ignore errors, that would not be silent any > >>>> more. > >>>>>> Right? > >>>>>> Of course, considering all types of `ApiException`s would be too > >>> broad. > >>>>> But > >>>>>> are the exceptions caught in `catch(ApiException e)` of the > >>> `doSend()` > >>>>>> method also too broad? > >>>>>> > >>>>>> -Alieh > >>>>>> > >>>>>> On Tue, Jul 2, 2024 at 9:45 PM Justine Olshan > >>>>> <jols...@confluent.io.invalid > >>>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> Hey Alieh, > >>>>>>> > >>>>>>> If we want to allow any error to be ignored we should probably > >> run > >>>>>> through > >>>>>>> all the errors to make sure they make sense. > >>>>>>> I just want to feel confident that we aren't just making a > >> decision > >>>>>> without > >>>>>>> considering the consequences carefully. > >>>>>>> > >>>>>>> Justine > >>>>>>> > >>>>>>> On Tue, Jul 2, 2024 at 12:30 PM Alieh Saeedi > >>>>>> <asae...@confluent.io.invalid > >>>>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hey Justine, > >>>>>>>> > >>>>>>>> yes we talked about `RecordTooLargeException` as an example, > >> but > >>>> did > >>>>> we > >>>>>>>> ever limit ourselves to only this specific exception? I think > >>>> neither > >>>>>> in > >>>>>>>> the KIP nor in the PR. As Chris mentioned, this KIP is going > >> to > >>>> undo > >>>>>>> what > >>>>>>>> we have done in `KAFKA-9279` in case 1) the user is in a > >>>> transaction > >>>>>> and > >>>>>>> 2) > >>>>>>>> he decides to ignore the errors in which the record was not > >> even > >>>>> added > >>>>>> to > >>>>>>>> the batch. Yes, and we suggested some methods for undoing or, > >> in > >>>>> fact, > >>>>>>>> moving back the transaction from the error state in `flush` or > >> in > >>>>>>>> `commitTnx` and we finally came to the idea of not even doing > >> the > >>>>>> changes > >>>>>>>> (better than undoing) in `send`. > >>>>>>>> > >>>>>>>> Bests, > >>>>>>>> Alieh > >>>>>>>> > >>>>>>>> On Tue, Jul 2, 2024 at 8:03 PM Justine Olshan > >>>>>>> <jols...@confluent.io.invalid > >>>>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hey folks, > >>>>>>>>> > >>>>>>>>> I understand where you are coming from by asking for specific > >>> use > >>>>>>> cases. > >>>>>>>> My > >>>>>>>>> understanding based on previous conversations was that there > >>>> were a > >>>>>> few > >>>>>>>>> different errors that have been seen. > >>>>>>>>> One example I heard some information about was when the > >> record > >>>> was > >>>>>> too > >>>>>>>>> large and it fails the batch. Besides that, I'm not really > >> sure > >>>> if > >>>>>>> there > >>>>>>>>> are cases in mind, though it is fair to ask on those and > >> bring > >>>> them > >>>>>> up. > >>>>>>>>> > >>>>>>>>>> Does a record qualify as a poison pill if it targets a > >> topic > >>>> that > >>>>>>>>> doesn't exist? Or if it targets a topic that the producer > >>>> principal > >>>>>>> lacks > >>>>>>>>> ACLs for? What if it fails broker-side validation (e.g., has > >> a > >>>> null > >>>>>> key > >>>>>>>> for > >>>>>>>>> a compacted topic)? > >>>>>>>>> > >>>>>>>>> I think there was some parallel work with addressing the > >>>>>>>>> UnknownTopicOrPartitionError in another way. As for the other > >>>>> checks, > >>>>>>>> acls, > >>>>>>>>> validation etc. I am not aware of that being in Alieh's > >> scope, > >>>> but > >>>>> we > >>>>>>>>> should be clear about exactly what we are doing. > >>>>>>>>> > >>>>>>>>> All errors that fall into ApiException seems too broad to me. > >>>>>>>>> > >>>>>>>>> Justine > >>>>>>>>> > >>>>>>>>> On Tue, Jul 2, 2024 at 10:51 AM Alieh Saeedi > >>>>>>>> <asae...@confluent.io.invalid > >>>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hey Chris, > >>>>>>>>>> thanks for sharing your concerns. > >>>>>>>>>> > >>>>>>>>>> 1) About the language of KIP (or maybe later in Javadocs): > >> Is > >>>>> that > >>>>>>>>> alright > >>>>>>>>>> if I write all errors that fall into the `ApiException` > >>>> category > >>>>>>> thrown > >>>>>>>>>> (actually returned) by Producer? > >>>>>>>>>> 2) About future expansion: do you have any better > >> suggestions > >>>> for > >>>>>>> enum > >>>>>>>>>> names? Do you think `IGNORE_API_EXEPTIONS` or something > >> like > >>>> that > >>>>>> is > >>>>>>> a > >>>>>>>>>> "better/more accurate" one? > >>>>>>>>>> > >>>>>>>>>> Bests, > >>>>>>>>>> Alieh > >>>>>>>>>> > >>>>>>>>>> On Tue, Jul 2, 2024 at 7:29 PM Chris Egerton > >>>>>> <chr...@aiven.io.invalid > >>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Alieh and Justine, > >>>>>>>>>>> > >>>>>>>>>>> I'm concerned that we're settling on a definition of > >>> "poison > >>>>>> pill" > >>>>>>>>> that's > >>>>>>>>>>> easiest to tackle right now but may lead to shortcomings > >>> down > >>>>> the > >>>>>>>>> road. I > >>>>>>>>>>> understand the relationship between this KIP and > >>> KAFKA-9279, > >>>>> and > >>>>>> I > >>>>>>>> can > >>>>>>>>>>> totally get behind the desire to keep things small, > >>> focused, > >>>>> and > >>>>>>>> simple > >>>>>>>>>> in > >>>>>>>>>>> the name of avoiding bugs. However, what I don't think is > >>>> clear > >>>>>> at > >>>>>>>> all > >>>>>>>>> is > >>>>>>>>>>> what the "specific circumstances" are that Justine > >>>> mentioned. I > >>>>>>> had a > >>>>>>>>>>> drastically different idea of what the intended > >> behavioral > >>>>> change > >>>>>>>> would > >>>>>>>>>> be > >>>>>>>>>>> before looking at the draft PR. > >>>>>>>>>>> > >>>>>>>>>>> I would like 1) for us to be clearer about the categories > >>> of > >>>>>> errors > >>>>>>>>> that > >>>>>>>>>> we > >>>>>>>>>>> want to cover with this new API (especially since we'll > >>> have > >>>> to > >>>>>>> find > >>>>>>>> a > >>>>>>>>>>> clear, succinct way to document this for users), and 2) > >> to > >>>> make > >>>>>>> sure > >>>>>>>>> that > >>>>>>>>>>> if we do try to expand this API in the future, that we > >>> won't > >>>> be > >>>>>>>> painted > >>>>>>>>>>> into a corner. > >>>>>>>>>>> > >>>>>>>>>>> For item 1, hopefully we can agree that the language in > >> the > >>>> KIP > >>>>>>>>>>> for IGNORE_SEND_ERRORS ("The records causing > >> irrecoverable > >>>>> errors > >>>>>>> are > >>>>>>>>>>> excluded from the batch and the transaction is committed > >>>>>>>>> successfully.") > >>>>>>>>>> is > >>>>>>>>>>> pretty vague. If we start using the phrase "poison pill > >>>> record" > >>>>>>> that > >>>>>>>>>> could > >>>>>>>>>>> help, but IMO more detail would still be needed. We know > >>> that > >>>>> we > >>>>>>> want > >>>>>>>>> to > >>>>>>>>>>> include records that are so large that they can be > >>>> immediately > >>>>>>>> rejected > >>>>>>>>>> by > >>>>>>>>>>> the producer. But there are other cases that users might > >>>> expect > >>>>>> to > >>>>>>> be > >>>>>>>>>>> handled. Does a record qualify as a poison pill if it > >>>> targets a > >>>>>>> topic > >>>>>>>>>> that > >>>>>>>>>>> doesn't exist? Or if it targets a topic that the producer > >>>>>> principal > >>>>>>>>> lacks > >>>>>>>>>>> ACLs for? What if it fails broker-side validation (e.g., > >>> has > >>>> a > >>>>>> null > >>>>>>>> key > >>>>>>>>>> for > >>>>>>>>>>> a compacted topic)? > >>>>>>>>>>> > >>>>>>>>>>> For item 2, this really depends on how narrow the scope > >> of > >>>> what > >>>>>>> we're > >>>>>>>>>> doing > >>>>>>>>>>> right now is. If we only handle a subset of the examples > >> I > >>>> laid > >>>>>> out > >>>>>>>>> above > >>>>>>>>>>> that could possibly be considered poison pills with this > >>> KIP, > >>>>> do > >>>>>> we > >>>>>>>>> want > >>>>>>>>>> to > >>>>>>>>>>> lock ourselves in to never addressing more in the future, > >>> or > >>>>> can > >>>>>> we > >>>>>>>>>> choose > >>>>>>>>>>> an API (probably just enum names would be the only > >>> important > >>>>>>> decision > >>>>>>>>>> here) > >>>>>>>>>>> that leaves room for more later? > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> > >>>>>>>>>>> Chris > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Jul 2, 2024 at 12:28 PM Justine Olshan > >>>>>>>>>>> <jols...@confluent.io.invalid> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Chris and Alieh, > >>>>>>>>>>>> > >>>>>>>>>>>> My understanding is that this KIP is really only trying > >>> to > >>>>>> solve > >>>>>>> an > >>>>>>>>>> issue > >>>>>>>>>>>> of a "poison pill" record that fails send(). > >>>>>>>>>>>> We've talked a lot about having a generic framework for > >>> all > >>>>>>> errors, > >>>>>>>>>> but I > >>>>>>>>>>>> don't think that is what this KIP is trying to do. > >>>>> Essentially > >>>>>>> the > >>>>>>>>>>> request > >>>>>>>>>>>> is to undo the change from KAFKA-9279 > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/KAFKA-9279> but > >>>> under > >>>>>>>>> specific > >>>>>>>>>>>> circumstances that are controlled. I really am > >> concerned > >>>>> about > >>>>>>>>> opening > >>>>>>>>>>> new > >>>>>>>>>>>> avenues for bugs with EOS and hesitate to handle any > >>> other > >>>>>> types > >>>>>>> of > >>>>>>>>>>> errors. > >>>>>>>>>>>> I think if we all agree on the problem that we are > >> trying > >>>> to > >>>>>>> solve, > >>>>>>>>> it > >>>>>>>>>> is > >>>>>>>>>>>> easier to agree on solutions. > >>>>>>>>>>>> > >>>>>>>>>>>> Justine > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, Jul 1, 2024 at 2:20 AM Alieh Saeedi > >>>>>>>>>> <asae...@confluent.io.invalid > >>>>>>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Matthias, > >>>>>>>>>>>>> Thanks for the valid points you mentioned. I updated > >>> the > >>>>> KIP > >>>>>>> and > >>>>>>>>> the > >>>>>>>>>> PR > >>>>>>>>>>>>> with: > >>>>>>>>>>>>> 1) mentioning that the new overloaded `send` throws > >>>>>>>>>>>> `IllegalStateException` > >>>>>>>>>>>>> if the user tries to ignore `send()` errors outside > >> of > >>> a > >>>>>>>>> transaction. > >>>>>>>>>>>>> 2) the default implementation in `Producer` interface > >>>>> throws > >>>>>> an > >>>>>>>>>>>>> `UnsupportedOperationException` > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi Chris, > >>>>>>>>>>>>> Thanks for the feedback. I tried to clarify the > >> points > >>>> you > >>>>>>>> listed: > >>>>>>>>>>>>> -------> we've narrowed the scope from any error that > >>>> might > >>>>>>> take > >>>>>>>>>> place > >>>>>>>>>>>> with > >>>>>>>>>>>>> producing a record to Kafka, to only the ones that > >> are > >>>>> thrown > >>>>>>>>>> directly > >>>>>>>>>>>> from > >>>>>>>>>>>>> Producer::send; > >>>>>>>>>>>>> > >>>>>>>>>>>>> From the very beginning and even since KIP-1038, the > >>> main > >>>>>>> purpose > >>>>>>>>> was > >>>>>>>>>>> to > >>>>>>>>>>>>> have "more flexibility," or, in other words, "giving > >>> the > >>>>> user > >>>>>>> the > >>>>>>>>>>>>> authority" to handle some specific exceptions thrown > >>> from > >>>>> the > >>>>>>>>>>> `Producer`. > >>>>>>>>>>>>> Due to the specific cases we had in mind, KIP-1038 > >> was > >>>>>>> discarded > >>>>>>>>> and > >>>>>>>>>> we > >>>>>>>>>>>>> decided to not define a `CustomExceptionHandler` for > >>>>>> `Producer` > >>>>>>>> and > >>>>>>>>>>>> instead > >>>>>>>>>>>>> treat the `send` failures in a different way. The > >> main > >>>>> issue > >>>>>> is > >>>>>>>>> that > >>>>>>>>>>>> `send` > >>>>>>>>>>>>> makes a transition to error state, which is undoable. > >>> In > >>>>>> fact, > >>>>>>>> one > >>>>>>>>>>> single > >>>>>>>>>>>>> poison pill record makes the whole batch fail. The > >>> former > >>>>>>>>> suggestions > >>>>>>>>>>>> that > >>>>>>>>>>>>> you agreed with have been all about un-doing this > >>>>> transition > >>>>>> in > >>>>>>>>>> `flush` > >>>>>>>>>>>> or > >>>>>>>>>>>>> `commit`. The new suggestion is to un-do (or better, > >>> NOT > >>>>> do) > >>>>>> in > >>>>>>>>>> `send` > >>>>>>>>>>>> due > >>>>>>>>>>>>> to the reasons listed in the discussions above. > >>>>>>>>>>>>> Moreover, I would say that having such a large scope > >> as > >>>> you > >>>>>>>>> mentioned > >>>>>>>>>>> is > >>>>>>>>>>>>> impossible. In the best case, we may have control > >> over > >>>> the > >>>>>>>>>> `Producer`. > >>>>>>>>>>>> What > >>>>>>>>>>>>> shall we do with the broker? The `any error that > >> might > >>>> take > >>>>>>> place > >>>>>>>>>> with > >>>>>>>>>>>>> producing a record to Kafka` is too much, I think. > >>>>>>>>>>>>> > >>>>>>>>>>>>> -------> is this all we want to handle, and will it > >>>> prevent > >>>>>> us > >>>>>>>> from > >>>>>>>>>>>>> handling more in the future in an intuitive way? > >>>>>>>>>>>>> > >>>>>>>>>>>>> I think yes. This is all we want. Other sorts of > >> errors > >>>>> such > >>>>>> as > >>>>>>>>>> having > >>>>>>>>>>>>> problem with partition addition, producer fenced > >>>> exception, > >>>>>> etc > >>>>>>>>> seem > >>>>>>>>>> to > >>>>>>>>>>>> be > >>>>>>>>>>>>> more serious issues. The intention was to handle > >>> problems > >>>>>>> created > >>>>>>>>> by > >>>>>>>>>>>>> (maybe) a single poison pill record. BTW, I do not > >> see > >>>> any > >>>>>>>>> obstacles > >>>>>>>>>> to > >>>>>>>>>>>>> future changes. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Bests, > >>>>>>>>>>>>> Alieh > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Sat, Jun 29, 2024 at 3:03 AM Chris Egerton > >>>>>>>>>> <chr...@aiven.io.invalid > >>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Ah, sorry--spoke too soon. The PR doesn't show that > >>>>> errors > >>>>>>>> thrown > >>>>>>>>>>> from > >>>>>>>>>>>>>> Producer::send are handled, but instead, > >> ApiException > >>>>>>> instances > >>>>>>>>>> that > >>>>>>>>>>>> are > >>>>>>>>>>>>>> caught inside KafkaProducer::doSend and are handled > >>> by > >>>>>>>> returning > >>>>>>>>> an > >>>>>>>>>>>>>> already-failed future are. I think the same > >> question > >>>>> still > >>>>>>>>> applies > >>>>>>>>>>> (is > >>>>>>>>>>>>> this > >>>>>>>>>>>>>> all we want to handle, and will it prevent us from > >>>>> handling > >>>>>>>> more > >>>>>>>>> in > >>>>>>>>>>> the > >>>>>>>>>>>>>> future in an intuitive way), though. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, Jun 28, 2024 at 8:57 PM Chris Egerton < > >>>>>>> chr...@aiven.io > >>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Alieh, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This KIP has evolved a lot since I last looked at > >>> it, > >>>>> but > >>>>>>> the > >>>>>>>>>>> changes > >>>>>>>>>>>>>> seem > >>>>>>>>>>>>>>> well thought-out both in semantics and API. One > >>>>>> clarifying > >>>>>>>>>>> question I > >>>>>>>>>>>>>> have > >>>>>>>>>>>>>>> is that it looks based on the draft PR that we've > >>>>>> narrowed > >>>>>>>> the > >>>>>>>>>>> scope > >>>>>>>>>>>>> from > >>>>>>>>>>>>>>> any error that might take place with producing a > >>>> record > >>>>>> to > >>>>>>>>> Kafka, > >>>>>>>>>>> to > >>>>>>>>>>>>> only > >>>>>>>>>>>>>>> the ones that are thrown directly from > >>>> Producer::send; > >>>>> is > >>>>>>>> that > >>>>>>>>>> the > >>>>>>>>>>>>>> intended > >>>>>>>>>>>>>>> behavior here? And if so, do you have thoughts on > >>> how > >>>>> we > >>>>>>>> might > >>>>>>>>>>>> design a > >>>>>>>>>>>>>>> follow-up KIP that would catch all errors > >>> (including > >>>>> ones > >>>>>>>>>> reported > >>>>>>>>>>>>>>> asynchronously instead of synchronously)? I'd > >> like > >>> it > >>>>> if > >>>>>> we > >>>>>>>>> could > >>>>>>>>>>>> leave > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>> door open for that without painting ourselves > >> into > >>>> too > >>>>>> much > >>>>>>>> of > >>>>>>>>> a > >>>>>>>>>>>> corner > >>>>>>>>>>>>>>> with the API design for this KIP. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Chris > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, Jun 28, 2024 at 6:31 PM Matthias J. Sax < > >>>>>>>>>> mj...@apache.org> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks Alieh, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> it seems this KIP can just pick between a couple > >>> of > >>>>>>>> tradeoffs. > >>>>>>>>>>>> Adding > >>>>>>>>>>>>> an > >>>>>>>>>>>>>>>> overloaded `send()` as the KIP propose makes > >> sense > >>>> to > >>>>> me > >>>>>>> and > >>>>>>>>>> seems > >>>>>>>>>>>> to > >>>>>>>>>>>>>>>> provides the cleanest solution compare to there > >>>>> options > >>>>>> we > >>>>>>>>>>>> discussed. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Given the explicit name of the passed-in option > >>> that > >>>>>>>>> highlights > >>>>>>>>>>> that > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> option is for TX only make is pretty clear and > >>>> avoids > >>>>>> the > >>>>>>>>> issue > >>>>>>>>>> of > >>>>>>>>>>>>>>>> `flush()` ambiguity. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Nit: We should make clear on the KIP though, > >> that > >>>> the > >>>>>> new > >>>>>>>>>> `send()` > >>>>>>>>>>>>>>>> overload would throw an `IllegalStateException` > >> if > >>>> TX > >>>>>> are > >>>>>>>> not > >>>>>>>>>> used > >>>>>>>>>>>>>>>> (similar to other TX methods like initTx(), etc) > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> About the `Producer` interface, I am not sure > >> how > >>>> this > >>>>>> was > >>>>>>>>> done > >>>>>>>>>> in > >>>>>>>>>>>> the > >>>>>>>>>>>>>>>> past (eg, KIP-266 added > >> `Consumer.poll(Duration)` > >>>> w/o > >>>>> a > >>>>>>>>> default > >>>>>>>>>>>>>>>> implementation), if we need a default > >>> implementation > >>>>> for > >>>>>>>>>> backward > >>>>>>>>>>>>>>>> compatibility or not? If we do want to add one, > >> I > >>>>> think > >>>>>> it > >>>>>>>>> would > >>>>>>>>>>> be > >>>>>>>>>>>>>>>> appropriate to throw an > >>>>> `UnsupportedOperationException` > >>>>>> by > >>>>>>>>>>> default, > >>>>>>>>>>>>>>>> instead of just keeping the default impl empty? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> My points are rather minor, and should not block > >>>> this > >>>>>> KIP > >>>>>>>>>> though. > >>>>>>>>>>>>>>>> Overall LGTM. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> -Matthias > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 6/27/24 1:28 PM, Alieh Saeedi wrote: > >>>>>>>>>>>>>>>>> Hi Justine, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks for the suggestion. > >>>>>>>>>>>>>>>>> Making applications to validate every single > >>>> record > >>>>> is > >>>>>>> not > >>>>>>>>> the > >>>>>>>>>>>> best > >>>>>>>>>>>>>> way, > >>>>>>>>>>>>>>>>> from an efficiency point of view. > >>>>>>>>>>>>>>>>> Moreover, between changing the behavior of the > >>>>>> Producer > >>>>>>> in > >>>>>>>>>>> `send` > >>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>> `commitTnx`, the former seems more reasonable > >>> and > >>>>>> clean. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Bests, > >>>>>>>>>>>>>>>>> Alieh > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Thu, Jun 27, 2024 at 8:14 PM Justine Olshan > >>>>>>>>>>>>>>>> <jols...@confluent.io.invalid> > >>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hey Alieh, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I see there are two options now. So folks > >> will > >>> be > >>>>>>>>> discussing > >>>>>>>>>>> the > >>>>>>>>>>>>>>>> approaches > >>>>>>>>>>>>>>>>>> and deciding the best way forward before we > >>> vote? > >>>>>>>>>>>>>>>>>> I do think there could be a problem with the > >>>>> approach > >>>>>>> on > >>>>>>>>>> commit > >>>>>>>>>>>> if > >>>>>>>>>>>>> we > >>>>>>>>>>>>>>>> get > >>>>>>>>>>>>>>>>>> stuck on an earlier error and have more > >> records > >>>>>>>>> (potentially > >>>>>>>>>> on > >>>>>>>>>>>> new > >>>>>>>>>>>>>>>>>> partitions) to commit as the current PR is > >>>>>> implemented. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I guess this takes us back to the question of > >>>>> whether > >>>>>>> the > >>>>>>>>>> error > >>>>>>>>>>>>>> should > >>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>> cleared on send. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> (And I guess at the back of my mind, I'm > >>>> wondering > >>>>> if > >>>>>>>> there > >>>>>>>>>> is > >>>>>>>>>>> a > >>>>>>>>>>>>> way > >>>>>>>>>>>>>>>> we can > >>>>>>>>>>>>>>>>>> validate the "posion pill" records > >> application > >>>> side > >>>>>>>> before > >>>>>>>>> we > >>>>>>>>>>>> even > >>>>>>>>>>>>>> try > >>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>> send them) > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Justine > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Wed, Jun 26, 2024 at 4:38 PM Alieh Saeedi > >>>>>>>>>>>>>>>> <asae...@confluent.io.invalid > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hi Justine, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I did not update the KIP with > >> `TxnSendOption` > >>>>> since > >>>>>> I > >>>>>>>>>> thought > >>>>>>>>>>>> it'd > >>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>> better discussed here beforehand. > >>>>>>>>>>>>>>>>>>> right now, there are 2 PRs: > >>>>>>>>>>>>>>>>>>> - the PR that implements the current version > >>> of > >>>>> the > >>>>>>> KIP: > >>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/16332 > >>>>>>>>>>>>>>>>>>> - the POC PR that clarifies the > >>> `TxnSendOption`: > >>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/16465 > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Bests, > >>>>>>>>>>>>>>>>>>> Alieh > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Thu, Jun 27, 2024 at 12:42 AM Justine > >>> Olshan > >>>>>>>>>>>>>>>>>>> <jols...@confluent.io.invalid> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Hey Alieh, > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I think I am a little confused. Are the 3 > >>>> points > >>>>>>> above > >>>>>>>>>>>> addressed > >>>>>>>>>>>>> by > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> KIP > >>>>>>>>>>>>>>>>>>>> or did something change? The PR seems to > >> not > >>>>>> include > >>>>>>>> this > >>>>>>>>>>>> change > >>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>>> still > >>>>>>>>>>>>>>>>>>>> has the CommitOption as well. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>> Justine > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Wed, Jun 26, 2024 at 2:15 PM Alieh > >> Saeedi > >>>>>>>>>>>>>>>>>>> <asae...@confluent.io.invalid > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Looking at the PR < > >>>>>>>>>>> https://github.com/apache/kafka/pull/16332 > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> corresponding to the KIP, there are some > >>>> points > >>>>>>> worthy > >>>>>>>>> of > >>>>>>>>>>>>> mention: > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 1) clearing the error ends up dirty/messy > >>> code > >>>>> in > >>>>>>>>>>>>>>>>>> `TransactionManager`. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 2) By clearing the error, we are actually > >>>> doing > >>>>> an > >>>>>>>>> illegal > >>>>>>>>>>>>>>>> transition > >>>>>>>>>>>>>>>>>>>> from > >>>>>>>>>>>>>>>>>>>>> `ABORTABLE_ERROR` to `IN_TRANSACTION` > >> which > >>> is > >>>>>>>>>> conceptually > >>>>>>>>>>>> not > >>>>>>>>>>>>>>>>>>>> acceptable. > >>>>>>>>>>>>>>>>>>>>> This can be the root cause of some issues, > >>>> with > >>>>>>>> perhaps > >>>>>>>>>>>> further > >>>>>>>>>>>>>>>>>> future > >>>>>>>>>>>>>>>>>>>>> changes by others. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 3) If the poison pill record `r1` causes a > >>>>>>> transition > >>>>>>>> to > >>>>>>>>>> the > >>>>>>>>>>>>> error > >>>>>>>>>>>>>>>>>>> state > >>>>>>>>>>>>>>>>>>>>> and then the next record `r2` requires > >>> adding > >>>> a > >>>>>>>>> partition > >>>>>>>>>> to > >>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>>> transaction, the action fails due to being > >>> in > >>>>> the > >>>>>>>> error > >>>>>>>>>>> state. > >>>>>>>>>>>>> In > >>>>>>>>>>>>>>>>>> this > >>>>>>>>>>>>>>>>>>>>> case, clearing errors during > >>>>>>>>> `commitTnx(CLEAR_SEND_ERROR)` > >>>>>>>>>>> is > >>>>>>>>>>>>> too > >>>>>>>>>>>>>>>>>> late. > >>>>>>>>>>>>>>>>>>>>> However, this case can NOT be the main > >>> concern > >>>>> as > >>>>>>> soon > >>>>>>>>> as > >>>>>>>>>>>>> KIP-890 > >>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>>>> fully > >>>>>>>>>>>>>>>>>>>>> implemented. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> My suggestion is to solve the problem > >> where > >>> it > >>>>>>> arises. > >>>>>>>>> If > >>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> transition > >>>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>>>> the error state does not happen during > >>>> `send()`, > >>>>>> we > >>>>>>> do > >>>>>>>>> not > >>>>>>>>>>>> need > >>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> clear > >>>>>>>>>>>>>>>>>>>>> the error later. Therefore, instead of > >>>>>>> `CommitOption`, > >>>>>>>>> we > >>>>>>>>>>> can > >>>>>>>>>>>>>> define > >>>>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>>>> `TxnSendOption` and prevent the `send()` > >>>> method > >>>>>> from > >>>>>>>>> going > >>>>>>>>>>> to > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> error > >>>>>>>>>>>>>>>>>>>>> state in case 1) we're in a transaction > >> and > >>> 2) > >>>>> the > >>>>>>>> user > >>>>>>>>>>> asked > >>>>>>>>>>>>> for > >>>>>>>>>>>>>>>>>>>>> IGONRE_SEND_ERRORS. For more clarity, you > >>> can > >>>>>> take a > >>>>>>>>> look > >>>>>>>>>> at > >>>>>>>>>>>> the > >>>>>>>>>>>>>> POC > >>>>>>>>>>>>>>>>>> PR > >>>>>>>>>>>>>>>>>>>>> < > >> https://github.com/apache/kafka/pull/16465 > >>>> . > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>>>>>>>>>> Alieh > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >