Hi, Justine, Thanks for the KIP. A couple of comments.
70. Currently, the producer epoch is an int. I am not sure if it's enough to accommodate all transactions in the lifetime of a producer. Should we change that to a long or add a new long field like txnId? 71. "it will write the prepare commit message with a bumped epoch and send WriteTxnMarkerRequests with the bumped epoch." Hmm, the epoch is associated with the current txn right? So, it seems weird to write a commit message with a bumped epoch. Should we only bump up the epoch in EndTxnResponse and rename the field to sth like nextProducerEpoch? Thanks, Jun On Mon, Dec 12, 2022 at 8:54 PM Matthias J. Sax <mj...@apache.org> wrote: > Thanks for the background. > > 20/30: SGTM. My proposal was only focusing to avoid dangling > transactions if records are added without registered partition. -- Maybe > you can add a few more details to the KIP about this scenario for better > documentation purpose? > > 40: I think you hit a fair point about race conditions or client bugs > (incorrectly not bumping the epoch). The complexity/confusion for using > the bumped epoch I see, is mainly for internal debugging, ie, inspecting > log segment dumps -- it seems harder to reason about the system for us > humans. But if we get better guarantees, it would be worth to use the > bumped epoch. > > 60: as I mentioned already, I don't know the broker internals to provide > more input. So if nobody else chimes in, we should just move forward > with your proposal. > > > -Matthias > > > On 12/6/22 4:22 PM, Justine Olshan wrote: > > Hi all, > > After Artem's questions about error behavior, I've re-evaluated the > > unknown producer ID exception and had some discussions offline. > > > > I think generally it makes sense to simplify error handling in cases like > > this and the UNKNOWN_PRODUCER_ID error has a pretty long and complicated > > history. Because of this, I propose adding a new error code > ABORTABLE_ERROR > > that when encountered by new clients (gated by the produce request > version) > > will simply abort the transaction. This allows the server to have some > say > > in whether the client aborts and makes handling much simpler. In the > > future, we can also use this error in other situations where we want to > > abort the transactions. We can even use on other apis. > > > > I've added this to the KIP. Let me know if there are any questions or > > issues. > > > > Justine > > > > On Fri, Dec 2, 2022 at 10:22 AM Justine Olshan <jols...@confluent.io> > wrote: > > > >> Hey Matthias, > >> > >> > >> 20/30 — Maybe I also didn't express myself clearly. For older clients we > >> don't have a way to distinguish between a previous and the current > >> transaction since we don't have the epoch bump. This means that a late > >> message from the previous transaction may be added to the new one. With > >> older clients — we can't guarantee this won't happen if we already sent > the > >> addPartitionsToTxn call (why we make changes for the newer client) but > we > >> can at least gate some by ensuring that the partition has been added to > the > >> transaction. The rationale here is that there are likely LESS late > arrivals > >> as time goes on, so hopefully most late arrivals will come in BEFORE the > >> addPartitionsToTxn call. Those that arrive before will be properly gated > >> with the describeTransactions approach. > >> > >> If we take the approach you suggested, ANY late arrival from a previous > >> transaction will be added. And we don't want that. I also don't see any > >> benefit in sending addPartitionsToTxn over the describeTxns call. They > will > >> both be one extra RPC to the Txn coordinator. > >> > >> > >> To be clear — newer clients will use addPartitionsToTxn instead of the > >> DescribeTxns. > >> > >> > >> 40) > >> My concern is that if we have some delay in the client to bump the > epoch, > >> it could continue to send epoch 73 and those records would not be > fenced. > >> Perhaps this is not an issue if we don't allow the next produce to go > >> through before the EndTxn request returns. I'm also thinking about > cases of > >> failure. I will need to think on this a bit. > >> > >> I wasn't sure if it was that confusing. But if we think it is, we can > >> investigate other ways. > >> > >> > >> 60) > >> > >> I'm not sure these are the same purgatories since one is a produce > >> purgatory (I was planning on using a callback rather than purgatory) and > >> the other is simply a request to append to the log. Not sure we have any > >> structure here for ordering, but my understanding is that the broker > could > >> handle the write request before it hears back from the Txn Coordinator. > >> > >> Let me know if I misunderstood something or something was unclear. > >> > >> Justine > >> > >> On Thu, Dec 1, 2022 at 12:15 PM Matthias J. Sax <mj...@apache.org> > wrote: > >> > >>> Thanks for the details Justine! > >>> > >>>> 20) > >>>> > >>>> The client side change for 2 is removing the addPartitions to > >>> transaction > >>>> call. We don't need to make this from the producer to the txn > >>> coordinator, > >>>> only server side. > >>> > >>> I think I did not express myself clearly. I understand that we can (and > >>> should) change the producer to not send the `addPartitions` request any > >>> longer. But I don't thinks it's requirement to change the broker? > >>> > >>> What I am trying to say is: as a safe-guard and improvement for older > >>> producers, the partition leader can just send the `addPartitions` > >>> request to the TX-coordinator in any case -- if the old producer > >>> correctly did send the `addPartition` request to the TX-coordinator > >>> already, the TX-coordinator can just "ignore" is as idempotent. > However, > >>> if the old producer has a bug and did forget to sent the `addPartition` > >>> request, we would now ensure that the partition is indeed added to the > >>> TX and thus fix a potential producer bug (even if we don't get the > >>> fencing via the bump epoch). -- It seems to be a good improvement? Or > is > >>> there a reason to not do this? > >>> > >>> > >>> > >>>> 30) > >>>> > >>>> Transaction is ongoing = partition was added to transaction via > >>>> addPartitionsToTxn. We check this with the DescribeTransactions call. > >>> Let > >>>> me know if this wasn't sufficiently explained here: > >>> > >>> If we do what I propose in (20), we don't really need to make this > >>> `DescribeTransaction` call, as the partition leader adds the partition > >>> for older clients and we get this check for free. > >>> > >>> > >>>> 40) > >>>> > >>>> The idea here is that if any messages somehow come in before we get > the > >>> new > >>>> epoch to the producer, they will be fenced. However, if we don't think > >>> this > >>>> is necessary, it can be discussed > >>> > >>> I agree that we should have epoch fencing. My question is different: > >>> Assume we are at epoch 73, and we have an ongoing transaction, that is > >>> committed. It seems natural to write the "prepare commit" marker and > the > >>> `WriteTxMarkerRequest` both with epoch 73, too, as it belongs to the > >>> current transaction. Of course, we now also bump the epoch and expect > >>> the next requests to have epoch 74, and would reject an request with > >>> epoch 73, as the corresponding TX for epoch 73 was already committed. > >>> > >>> It seems you propose to write the "prepare commit marker" and > >>> `WriteTxMarkerRequest` with epoch 74 though, what would work, but it > >>> seems confusing. Is there a reason why we would use the bumped epoch 74 > >>> instead of the current epoch 73? > >>> > >>> > >>>> 60) > >>>> > >>>> When we are checking if the transaction is ongoing, we need to make a > >>> round > >>>> trip from the leader partition to the transaction coordinator. In the > >>> time > >>>> we are waiting for this message to come back, in theory we could have > >>> sent > >>>> a commit/abort call that would make the original result of the check > >>> out of > >>>> date. That is why we can check the leader state before we write to the > >>> log. > >>> > >>> Thanks. Got it. > >>> > >>> However, is this really an issue? We put the produce request in > >>> purgatory, so how could we process the `WriteTxnMarkerRequest` first? > >>> Don't we need to put the `WriteTxnMarkerRequest` into purgatory, too, > >>> for this case, and process both request in-order? (Again, my broker > >>> knowledge is limited and maybe we don't maintain request order for this > >>> case, what seems to be an issue IMHO, and I am wondering if changing > >>> request handling to preserve order for this case might be the cleaner > >>> solution?) > >>> > >>> > >>> > >>> -Matthias > >>> > >>> > >>> > >>> > >>> On 11/30/22 3:28 PM, Artem Livshits wrote: > >>>> Hi Justine, > >>>> > >>>> I think the interesting part is not in this logic (because it tries to > >>>> figure out when UNKNOWN_PRODUCER_ID is retriable and if it's > retryable, > >>>> it's definitely not fatal), but what happens when this logic doesn't > >>> return > >>>> 'true' and falls through. In the old clients it seems to be fatal, if > >>> we > >>>> keep the behavior in the new clients, I'd expect it would be fatal as > >>> well. > >>>> > >>>> -Artem > >>>> > >>>> On Tue, Nov 29, 2022 at 11:57 AM Justine Olshan > >>>> <jols...@confluent.io.invalid> wrote: > >>>> > >>>>> Hi Artem and Jeff, > >>>>> > >>>>> > >>>>> Thanks for taking a look and sorry for the slow response. > >>>>> > >>>>> You both mentioned the change to handle UNKNOWN_PRODUCER_ID errors. > To > >>> be > >>>>> clear — this error code will only be sent again when the client's > >>> request > >>>>> version is high enough to ensure we handle it correctly. > >>>>> The current (Java) client handles this by the following (somewhat > long) > >>>>> code snippet: > >>>>> > >>>>> // An UNKNOWN_PRODUCER_ID means that we have lost the producer state > >>> on the > >>>>> broker. Depending on the log start > >>>>> > >>>>> // offset, we may want to retry these, as described for each case > >>> below. If > >>>>> none of those apply, then for the > >>>>> > >>>>> // idempotent producer, we will locally bump the epoch and reset the > >>>>> sequence numbers of in-flight batches from > >>>>> > >>>>> // sequence 0, then retry the failed batch, which should now succeed. > >>> For > >>>>> the transactional producer, allow the > >>>>> > >>>>> // batch to fail. When processing the failed batch, we will > transition > >>> to > >>>>> an abortable error and set a flag > >>>>> > >>>>> // indicating that we need to bump the epoch (if supported by the > >>> broker). > >>>>> > >>>>> if (error == Errors.*UNKNOWN_PRODUCER_ID*) { > >>>>> > >>>>> if (response.logStartOffset == -1) { > >>>>> > >>>>> // We don't know the log start offset with this response. > We > >>> should > >>>>> just retry the request until we get it. > >>>>> > >>>>> // The UNKNOWN_PRODUCER_ID error code was added along with > >>> the new > >>>>> ProduceResponse which includes the > >>>>> > >>>>> // logStartOffset. So the '-1' sentinel is not for backward > >>>>> compatibility. Instead, it is possible for > >>>>> > >>>>> // a broker to not know the logStartOffset at when it is > >>> returning > >>>>> the response because the partition > >>>>> > >>>>> // may have moved away from the broker from the time the > >>> error was > >>>>> initially raised to the time the > >>>>> > >>>>> // response was being constructed. In these cases, we > should > >>> just > >>>>> retry the request: we are guaranteed > >>>>> > >>>>> // to eventually get a logStartOffset once things settle > down. > >>>>> > >>>>> return true; > >>>>> > >>>>> } > >>>>> > >>>>> > >>>>> if (batch.sequenceHasBeenReset()) { > >>>>> > >>>>> // When the first inflight batch fails due to the > truncation > >>> case, > >>>>> then the sequences of all the other > >>>>> > >>>>> // in flight batches would have been restarted from the > >>> beginning. > >>>>> However, when those responses > >>>>> > >>>>> // come back from the broker, they would also come with an > >>>>> UNKNOWN_PRODUCER_ID error. In this case, we should not > >>>>> > >>>>> // reset the sequence numbers to the beginning. > >>>>> > >>>>> return true; > >>>>> > >>>>> } else if (lastAckedOffset(batch.topicPartition).orElse( > >>>>> *NO_LAST_ACKED_SEQUENCE_NUMBER*) < response.logStartOffset) { > >>>>> > >>>>> // The head of the log has been removed, probably due to > the > >>>>> retention time elapsing. In this case, > >>>>> > >>>>> // we expect to lose the producer state. For the > transactional > >>>>> producer, reset the sequences of all > >>>>> > >>>>> // inflight batches to be from the beginning and retry > them, > >>> so > >>>>> that the transaction does not need to > >>>>> > >>>>> // be aborted. For the idempotent producer, bump the epoch > to > >>> avoid > >>>>> reusing (sequence, epoch) pairs > >>>>> > >>>>> if (isTransactional()) { > >>>>> > >>>>> > >>> txnPartitionMap.startSequencesAtBeginning(batch.topicPartition, > >>>>> this.producerIdAndEpoch); > >>>>> > >>>>> } else { > >>>>> > >>>>> requestEpochBumpForPartition(batch.topicPartition); > >>>>> > >>>>> } > >>>>> > >>>>> return true; > >>>>> > >>>>> } > >>>>> > >>>>> > >>>>> if (!isTransactional()) { > >>>>> > >>>>> // For the idempotent producer, always retry > >>> UNKNOWN_PRODUCER_ID > >>>>> errors. If the batch has the current > >>>>> > >>>>> // producer ID and epoch, request a bump of the epoch. > >>> Otherwise > >>>>> just retry the produce. > >>>>> > >>>>> requestEpochBumpForPartition(batch.topicPartition); > >>>>> > >>>>> return true; > >>>>> > >>>>> } > >>>>> > >>>>> } > >>>>> > >>>>> > >>>>> I was considering keeping this behavior — but am open to simplifying > >>> it. > >>>>> > >>>>> > >>>>> > >>>>> We are leaving changes to older clients off the table here since it > >>> caused > >>>>> many issues for clients in the past. Previously this was a fatal > error > >>> and > >>>>> we didn't have the mechanisms in place to detect when this was a > >>> legitimate > >>>>> case vs some bug or gap in the protocol. Ensuring each transaction > has > >>> its > >>>>> own epoch should close this gap. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> And to address Jeff's second point: > >>>>> *does the typical produce request path append records to local log > >>> along* > >>>>> > >>>>> *with the currentTxnFirstOffset information? I would like to > >>> understand* > >>>>> > >>>>> *when the field is written to disk.* > >>>>> > >>>>> > >>>>> Yes, the first produce request populates this field and writes the > >>> offset > >>>>> as part of the record batch and also to the producer state snapshot. > >>> When > >>>>> we reload the records on restart and/or reassignment, we repopulate > >>> this > >>>>> field with the snapshot from disk along with the rest of the producer > >>>>> state. > >>>>> > >>>>> Let me know if there are further comments and/or questions. > >>>>> > >>>>> Thanks, > >>>>> Justine > >>>>> > >>>>> On Tue, Nov 22, 2022 at 9:00 PM Jeff Kim > <jeff....@confluent.io.invalid > >>>> > >>>>> wrote: > >>>>> > >>>>>> Hi Justine, > >>>>>> > >>>>>> Thanks for the KIP! I have two questions: > >>>>>> > >>>>>> 1) For new clients, we can once again return an error > >>> UNKNOWN_PRODUCER_ID > >>>>>> for sequences > >>>>>> that are non-zero when there is no producer state present on the > >>> server. > >>>>>> This will indicate we missed the 0 sequence and we don't yet want to > >>>>> write > >>>>>> to the log. > >>>>>> > >>>>>> I would like to understand the current behavior to handle older > >>> clients, > >>>>>> and if there are any changes we are making. Maybe I'm missing > >>> something, > >>>>>> but we would want to identify whether we missed the 0 sequence for > >>> older > >>>>>> clients, no? > >>>>>> > >>>>>> 2) Upon returning from the transaction coordinator, we can set the > >>>>>> transaction > >>>>>> as ongoing on the leader by populating currentTxnFirstOffset > >>>>>> through the typical produce request handling. > >>>>>> > >>>>>> does the typical produce request path append records to local log > >>> along > >>>>>> with the currentTxnFirstOffset information? I would like to > understand > >>>>>> when the field is written to disk. > >>>>>> > >>>>>> Thanks, > >>>>>> Jeff > >>>>>> > >>>>>> > >>>>>> On Tue, Nov 22, 2022 at 4:44 PM Artem Livshits > >>>>>> <alivsh...@confluent.io.invalid> wrote: > >>>>>> > >>>>>>> Hi Justine, > >>>>>>> > >>>>>>> Thank you for the KIP. I have one question. > >>>>>>> > >>>>>>> 5) For new clients, we can once again return an error > >>>>> UNKNOWN_PRODUCER_ID > >>>>>>> > >>>>>>> I believe we had problems in the past with returning > >>>>> UNKNOWN_PRODUCER_ID > >>>>>>> because it was considered fatal and required client restart. It > >>> would > >>>>> be > >>>>>>> good to spell out the new client behavior when it receives the > error. > >>>>>>> > >>>>>>> -Artem > >>>>>>> > >>>>>>> On Tue, Nov 22, 2022 at 10:00 AM Justine Olshan > >>>>>>> <jols...@confluent.io.invalid> wrote: > >>>>>>> > >>>>>>>> Thanks for taking a look Matthias. I've tried to answer your > >>>>> questions > >>>>>>>> below: > >>>>>>>> > >>>>>>>> 10) > >>>>>>>> > >>>>>>>> Right — so the hanging transaction only occurs when we have a late > >>>>>>> message > >>>>>>>> come in and the partition is never added to a transaction again. > If > >>>>> we > >>>>>>>> never add the partition to a transaction, we will never write a > >>>>> marker > >>>>>>> and > >>>>>>>> never advance the LSO. > >>>>>>>> > >>>>>>>> If we do end up adding the partition to the transaction (I suppose > >>>>> this > >>>>>>> can > >>>>>>>> happen before or after the late message comes in) then we will > >>>>> include > >>>>>>> the > >>>>>>>> late message in the next (incorrect) transaction. > >>>>>>>> > >>>>>>>> So perhaps it is clearer to make the distinction between messages > >>>>> that > >>>>>>>> eventually get added to the transaction (but the wrong one) or > >>>>> messages > >>>>>>>> that never get added and become hanging. > >>>>>>>> > >>>>>>>> > >>>>>>>> 20) > >>>>>>>> > >>>>>>>> The client side change for 2 is removing the addPartitions to > >>>>>> transaction > >>>>>>>> call. We don't need to make this from the producer to the txn > >>>>>>> coordinator, > >>>>>>>> only server side. > >>>>>>>> > >>>>>>>> > >>>>>>>> In my opinion, the issue with the addPartitionsToTxn call for > older > >>>>>>> clients > >>>>>>>> is that we don't have the epoch bump, so we don't know if the > >>> message > >>>>>>>> belongs to the previous transaction or this one. We need to check > if > >>>>>> the > >>>>>>>> partition has been added to this transaction. Of course, this > means > >>>>> we > >>>>>>>> won't completely cover the case where we have a really late > message > >>>>> and > >>>>>>> we > >>>>>>>> have added the partition to the new transaction, but that's > >>>>>> unfortunately > >>>>>>>> something we will need the new clients to cover. > >>>>>>>> > >>>>>>>> > >>>>>>>> 30) > >>>>>>>> > >>>>>>>> Transaction is ongoing = partition was added to transaction via > >>>>>>>> addPartitionsToTxn. We check this with the DescribeTransactions > >>> call. > >>>>>> Let > >>>>>>>> me know if this wasn't sufficiently explained here: > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense#KIP890:TransactionsServerSideDefense-EnsureOngoingTransactionforOlderClients(3) > >>>>>>>> > >>>>>>>> > >>>>>>>> 40) > >>>>>>>> > >>>>>>>> The idea here is that if any messages somehow come in before we > get > >>>>> the > >>>>>>> new > >>>>>>>> epoch to the producer, they will be fenced. However, if we don't > >>>>> think > >>>>>>> this > >>>>>>>> is necessary, it can be discussed > >>>>>>>> > >>>>>>>> > >>>>>>>> 50) > >>>>>>>> > >>>>>>>> It should be synchronous because if we have an event (ie, an > error) > >>>>>> that > >>>>>>>> causes us to need to abort the transaction, we need to know which > >>>>>>>> partitions to send transaction markers to. We know the partitions > >>>>>> because > >>>>>>>> we added them to the coordinator via the addPartitionsToTxn call. > >>>>>>>> Previously we have had asynchronous calls in the past (ie, writing > >>>>> the > >>>>>>>> commit markers when the transaction is completed) but often this > >>> just > >>>>>>>> causes confusion as we need to wait for some operations to > complete. > >>>>> In > >>>>>>> the > >>>>>>>> writing commit markers case, clients often see > >>>>> CONCURRENT_TRANSACTIONs > >>>>>>>> error messages and that can be confusing. For that reason, it may > be > >>>>>>>> simpler to just have synchronous calls — especially if we need to > >>>>> block > >>>>>>> on > >>>>>>>> some operation's completion anyway before we can start the next > >>>>>>>> transaction. And yes, I meant coordinator. I will fix that. > >>>>>>>> > >>>>>>>> > >>>>>>>> 60) > >>>>>>>> > >>>>>>>> When we are checking if the transaction is ongoing, we need to > make > >>> a > >>>>>>> round > >>>>>>>> trip from the leader partition to the transaction coordinator. In > >>> the > >>>>>>> time > >>>>>>>> we are waiting for this message to come back, in theory we could > >>> have > >>>>>>> sent > >>>>>>>> a commit/abort call that would make the original result of the > check > >>>>>> out > >>>>>>> of > >>>>>>>> date. That is why we can check the leader state before we write to > >>>>> the > >>>>>>> log. > >>>>>>>> > >>>>>>>> > >>>>>>>> I'm happy to update the KIP if some of these things were not > clear. > >>>>>>>> Thanks, > >>>>>>>> Justine > >>>>>>>> > >>>>>>>> On Mon, Nov 21, 2022 at 7:11 PM Matthias J. Sax <mj...@apache.org > > > >>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Thanks for the KIP. > >>>>>>>>> > >>>>>>>>> Couple of clarification questions (I am not a broker expert do > >>>>> maybe > >>>>>>>>> some question are obvious for others, but not for me with my lack > >>>>> of > >>>>>>>>> broker knowledge). > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> (10) > >>>>>>>>> > >>>>>>>>>> The delayed message case can also violate EOS if the delayed > >>>>>> message > >>>>>>>>> comes in after the next addPartitionsToTxn request comes in. > >>>>>>> Effectively > >>>>>>>> we > >>>>>>>>> may see a message from a previous (aborted) transaction become > part > >>>>>> of > >>>>>>>> the > >>>>>>>>> next transaction. > >>>>>>>>> > >>>>>>>>> What happens if the message come in before the next > >>>>>> addPartitionsToTxn > >>>>>>>>> request? It seems the broker hosting the data partitions won't > know > >>>>>>>>> anything about it and append it to the partition, too? What is > the > >>>>>>>>> difference between both cases? > >>>>>>>>> > >>>>>>>>> Also, it seems a TX would only hang, if there is no following TX > >>>>> that > >>>>>>> is > >>>>>>>>> either committer or aborted? Thus, for the case above, the TX > might > >>>>>>>>> actually not hang (of course, we might get an EOS violation if > the > >>>>>>> first > >>>>>>>>> TX was aborted and the second committed, or the other way > around). > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> (20) > >>>>>>>>> > >>>>>>>>>> Of course, 1 and 2 require client-side changes, so for older > >>>>>> clients, > >>>>>>>>> those approaches won’t apply. > >>>>>>>>> > >>>>>>>>> For (1) I understand why a client change is necessary, but not > sure > >>>>>> why > >>>>>>>>> we need a client change for (2). Can you elaborate? -- Later you > >>>>>>> explain > >>>>>>>>> that we should send a DescribeTransactionRequest, but I am not > sure > >>>>>>> why? > >>>>>>>>> Can't we not just do an implicit AddPartiitonToTx, too? If the > old > >>>>>>>>> producer correctly registered the partition already, the > >>>>>> TX-coordinator > >>>>>>>>> can just ignore it as it's an idempotent operation? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> (30) > >>>>>>>>> > >>>>>>>>>> To cover older clients, we will ensure a transaction is ongoing > >>>>>>> before > >>>>>>>>> we write to a transaction > >>>>>>>>> > >>>>>>>>> Not sure what you mean by this? Can you elaborate? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> (40) > >>>>>>>>> > >>>>>>>>>> [the TX-coordinator] will write the prepare commit message with > a > >>>>>>>> bumped > >>>>>>>>> epoch and send WriteTxnMarkerRequests with the bumped epoch. > >>>>>>>>> > >>>>>>>>> Why do we use the bumped epoch for both? It seems more intuitive > to > >>>>>> use > >>>>>>>>> the current epoch, and only return the bumped epoch to the > >>>>> producer? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> (50) "Implicit AddPartitionToTransaction" > >>>>>>>>> > >>>>>>>>> Why does the implicitly sent request need to be synchronous? The > >>>>> KIP > >>>>>>>>> also says > >>>>>>>>> > >>>>>>>>>> in case we need to abort and need to know which partitions > >>>>>>>>> > >>>>>>>>> What do you mean by this? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> we don’t want to write to it before we store in the transaction > >>>>>>> manager > >>>>>>>>> > >>>>>>>>> Do you mean TX-coordinator instead of "manager"? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> (60) > >>>>>>>>> > >>>>>>>>> For older clients and ensuring that the TX is ongoing, you > >>>>> describe a > >>>>>>>>> race condition. I am not sure if I can follow here. Can you > >>>>>> elaborate? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -Matthias > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 11/18/22 1:21 PM, Justine Olshan wrote: > >>>>>>>>>> Hey all! > >>>>>>>>>> > >>>>>>>>>> I'd like to start a discussion on my proposal to add some > >>>>>> server-side > >>>>>>>>>> checks on transactions to avoid hanging transactions. I know > this > >>>>>> has > >>>>>>>>> been > >>>>>>>>>> an issue for some time, so I really hope this KIP will be > helpful > >>>>>> for > >>>>>>>>> many > >>>>>>>>>> users of EOS. > >>>>>>>>>> > >>>>>>>>>> The KIP includes changes that will be compatible with old > clients > >>>>>> and > >>>>>>>>>> changes to improve performance and correctness on new clients. > >>>>>>>>>> > >>>>>>>>>> Please take a look and leave any comments you may have! > >>>>>>>>>> > >>>>>>>>>> KIP: > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense > >>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/KAFKA-14402 > >>>>>>>>>> > >>>>>>>>>> Thanks! > >>>>>>>>>> Justine > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >