Hi Colt, You can certainly review PRs, but you cannot merge them. Reviews are still valuable as they help lighten the workload on committers and, to some degree, provide a signal of how high-priority a change is to the community.
Cheers, Chris On Sun, Sep 11, 2022 at 12:19 PM Colt McNealy <c...@littlehorse.io> wrote: > Hi all— > > I'm not a committer so I can't review this PR (or is that not true?). > However, I'd like to bump this as well. I believe that I've encountered > this bug during chaos testing with the transactional producer. I can > sometimes produce this error when killing a broker during a long-running > transaction, which causes a batch to encounter delivery timeout as > described in the Jira. I have observed some inconsistencies with > the consumer offset being advanced prematurely (i.e. perhaps after the > delivery of the EndTxnRequest). > > Daniel, thank you for the PR. > > Cheers, > Colt McNealy > *Founder, LittleHorse.io* > > On Fri, Sep 9, 2022 at 9:54 AM Dániel Urbán <urb.dani...@gmail.com> wrote: > > > Hi all, > > > > I would like to bump this and bring some attention to the issue. > > This is a nasty bug in the transactional producer, would be nice if I > could > > get some feedback on the PR: https://github.com/apache/kafka/pull/12392 > > > > Thanks in advance, > > Daniel > > > > Viktor Somogyi-Vass <viktor.somo...@cloudera.com.invalid> ezt írta > > (időpont: 2022. júl. 25., H, 15:28): > > > > > Hi Luke & Artem, > > > > > > We prepared the fix, would you please help in getting a > > committer-reviewer > > > to get this issue resolved? > > > > > > Thanks, > > > Viktor > > > > > > On Fri, Jul 8, 2022 at 12:57 PM Dániel Urbán <urb.dani...@gmail.com> > > > wrote: > > > > > > > Submitted a PR with the fix: > > https://github.com/apache/kafka/pull/12392 > > > > In the PR I tried keeping the producer in a usable state after the > > forced > > > > bump. I understand that it might be the cleanest solution, but the > only > > > > other option I know of is to transition into a fatal state, meaning > > that > > > > the producer has to be recreated after a delivery timeout. I think > that > > > is > > > > still fine compared to the out-of-order messages. > > > > > > > > Looking forward to your reviews, > > > > Daniel > > > > > > > > Dániel Urbán <urb.dani...@gmail.com> ezt írta (időpont: 2022. júl. > 7., > > > Cs, > > > > 12:04): > > > > > > > > > Thanks for the feedback, I created > > > > > https://issues.apache.org/jira/browse/KAFKA-14053 and started > > working > > > on > > > > > a PR. > > > > > > > > > > Luke, for the workaround, we used the transaction admin tool > released > > > in > > > > > 3.0 to "abort" these hanging batches manually. > > > > > Naturally, the cluster health should be stabilized. This issue > popped > > > up > > > > > most frequently around times when some partitions went into a few > > > minute > > > > > window of unavailability. The infinite retries on the producer side > > > > caused > > > > > a situation where the last retry was still in-flight, but the > > delivery > > > > > timeout was triggered on the client side. We reduced the retries > and > > > > > increased the delivery timeout to avoid such situations. > > > > > Still, the issue can occur in other scenarios, for example a client > > > > > queueing up many batches in the producer buffer, and causing those > > > > batches > > > > > to spend most of the delivery timeout window in the client memory. > > > > > > > > > > Thanks, > > > > > Daniel > > > > > > > > > > Luke Chen <show...@gmail.com> ezt írta (időpont: 2022. júl. 7., > Cs, > > > > 5:13): > > > > > > > > > >> Hi Daniel, > > > > >> > > > > >> Thanks for reporting the issue, and the investigation. > > > > >> I'm curious, so, what's your workaround for this issue? > > > > >> > > > > >> I agree with Artem, it makes sense. Please file a bug in JIRA. > > > > >> And looking forward to your PR! :) > > > > >> > > > > >> Thank you. > > > > >> Luke > > > > >> > > > > >> On Thu, Jul 7, 2022 at 3:07 AM Artem Livshits > > > > >> <alivsh...@confluent.io.invalid> wrote: > > > > >> > > > > >> > Hi Daniel, > > > > >> > > > > > >> > What you say makes sense. Could you file a bug and put this > info > > > > there > > > > >> so > > > > >> > that it's easier to track? > > > > >> > > > > > >> > -Artem > > > > >> > > > > > >> > On Wed, Jul 6, 2022 at 8:34 AM Dániel Urbán < > > urb.dani...@gmail.com> > > > > >> wrote: > > > > >> > > > > > >> > > Hello everyone, > > > > >> > > > > > > >> > > I've been investigating some transaction related issues in a > > very > > > > >> > > problematic cluster. Besides finding some interesting issues, > I > > > had > > > > >> some > > > > >> > > ideas about how transactional producer behavior could be > > improved. > > > > >> > > > > > > >> > > My suggestion in short is: when the transactional producer > > > > encounters > > > > >> an > > > > >> > > error which doesn't necessarily mean that the in-flight > request > > > was > > > > >> > > processed (for example a client side timeout), the producer > > should > > > > not > > > > >> > send > > > > >> > > an EndTxnRequest on abort, but instead it should bump the > > producer > > > > >> epoch. > > > > >> > > > > > > >> > > The long description about the issue I found, and how I came > to > > > the > > > > >> > > suggestion: > > > > >> > > > > > > >> > > First, the description of the issue. When I say that the > cluster > > > is > > > > >> "very > > > > >> > > problematic", I mean all kinds of different issues, be it > infra > > > > (disks > > > > >> > and > > > > >> > > network) or throughput (high volume producers without fine > > > tuning). > > > > >> > > In this cluster, Kafka transactions are widely used by many > > > > producers. > > > > >> > And > > > > >> > > in this cluster, partitions get "stuck" frequently (few times > > > every > > > > >> > week). > > > > >> > > > > > > >> > > The exact meaning of a partition being "stuck" is this: > > > > >> > > > > > > >> > > On the client side: > > > > >> > > 1. A transactional producer sends X batches to a partition in > a > > > > single > > > > >> > > transaction > > > > >> > > 2. Out of the X batches, the last few get sent, but are timed > > out > > > > >> thanks > > > > >> > to > > > > >> > > the delivery timeout config > > > > >> > > 3. producer.flush() is unblocked due to all batches being > > > "finished" > > > > >> > > 4. Based on the errors reported in the producer.send() > callback, > > > > >> > > producer.abortTransaction() is called > > > > >> > > 5. Then producer.close() is also invoked with a 5s timeout > (this > > > > >> > > application does not reuse the producer instances optimally) > > > > >> > > 6. The transactional.id of the producer is never reused (it > was > > > > >> random > > > > >> > > generated) > > > > >> > > > > > > >> > > On the partition leader side (what appears in the log segment > of > > > the > > > > >> > > partition): > > > > >> > > 1. The batches sent by the producer are all appended to the > log > > > > >> > > 2. But the ABORT marker of the transaction was appended before > > the > > > > >> last 1 > > > > >> > > or 2 batches of the transaction > > > > >> > > > > > > >> > > On the transaction coordinator side (what appears in the > > > transaction > > > > >> > state > > > > >> > > partition): > > > > >> > > The transactional.id is present with the Empty state. > > > > >> > > > > > > >> > > These happenings result in the following: > > > > >> > > 1. The partition leader handles the first batch after the > ABORT > > > > >> marker as > > > > >> > > the first message of a new transaction of the same producer > id + > > > > >> epoch. > > > > >> > > (LSO is blocked at this point) > > > > >> > > 2. The transaction coordinator is not aware of any in-progress > > > > >> > transaction > > > > >> > > of the producer, thus never aborting the transaction, not even > > > after > > > > >> the > > > > >> > > transaction.timeout.ms passes. > > > > >> > > > > > > >> > > This is happening with Kafka 2.5 running in the cluster, > > producer > > > > >> > versions > > > > >> > > range between 2.0 and 2.6. > > > > >> > > I scanned through a lot of tickets, and I believe that this > > issue > > > is > > > > >> not > > > > >> > > specific to these versions, and could happen with newest > > versions > > > as > > > > >> > well. > > > > >> > > If I'm mistaken, some pointers would be appreciated. > > > > >> > > > > > > >> > > Assuming that the issue could occur with any version, I > believe > > > this > > > > >> > issue > > > > >> > > boils down to one oversight on the client side: > > > > >> > > When a request fails without a definitive response (e.g. a > > > delivery > > > > >> > > timeout), the client cannot assume that the request is > > "finished", > > > > and > > > > >> > > simply abort the transaction. If the request is still in > flight, > > > and > > > > >> the > > > > >> > > EndTxnRequest, then the WriteTxnMarkerRequest gets sent and > > > > processed > > > > >> > > earlier, the contract is violated by the client. > > > > >> > > This could be avoided by providing more information to the > > > partition > > > > >> > > leader. Right now, a new transactional batch signals the start > > of > > > a > > > > >> new > > > > >> > > transaction, and there is no way for the partition leader to > > > decide > > > > >> > whether > > > > >> > > the batch is an out-of-order message. > > > > >> > > In a naive and wasteful protocol, we could have a unique > > > transaction > > > > >> id > > > > >> > > added to each batch and marker, meaning that the leader would > be > > > > >> capable > > > > >> > of > > > > >> > > refusing batches which arrive after the control marker of the > > > > >> > transaction. > > > > >> > > But instead of changing the log format and the protocol, we > can > > > > >> achieve > > > > >> > the > > > > >> > > same by bumping the producer epoch. > > > > >> > > > > > > >> > > Bumping the epoch has a similar effect to "changing the > > > transaction > > > > >> id" - > > > > >> > > the in-progress transaction will be aborted with a bumped > > producer > > > > >> epoch, > > > > >> > > telling the partition leader about the producer epoch change. > > From > > > > >> this > > > > >> > > point on, any batches sent with the old epoch will be refused > by > > > the > > > > >> > leader > > > > >> > > due to the fencing mechanism. It doesn't really matter how > many > > > > >> batches > > > > >> > > will get appended to the log, and how many will be refused - > > this > > > is > > > > >> an > > > > >> > > aborted transaction - but the out-of-order message cannot > occur, > > > and > > > > >> > cannot > > > > >> > > block the LSO infinitely. > > > > >> > > > > > > >> > > My suggestion is, that the TransactionManager inside the > > producer > > > > >> should > > > > >> > > keep track of what type of errors were encountered by the > > batches > > > of > > > > >> the > > > > >> > > transaction, and categorize them along the lines of > "definitely > > > > >> > completed" > > > > >> > > and "might not be completed". When the transaction goes into > an > > > > >> abortable > > > > >> > > state, and there is at least one batch with "might not be > > > > completed", > > > > >> the > > > > >> > > EndTxnRequest should be skipped, and an epoch bump should be > > sent. > > > > >> > > As for what type of error counts as "might not be completed", > I > > > can > > > > >> only > > > > >> > > think of client side timeouts. > > > > >> > > > > > > >> > > I believe this is a relatively small change (only affects the > > > client > > > > >> > lib), > > > > >> > > but it helps in avoiding some corrupt states in Kafka > > > transactions. > > > > >> > > > > > > >> > > Looking forward to your input. If it seems like a sane idea, I > > go > > > > >> ahead > > > > >> > and > > > > >> > > submit a PR for it as well. > > > > >> > > > > > > >> > > Thanks in advance, > > > > >> > > Daniel > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > >