Re: Transactions, delivery timeout and changing transactional producer behavior

Dániel Urbán Fri, 08 Jul 2022 03:57:33 -0700

Submitted a PR with the fix: https://github.com/apache/kafka/pull/12392
In the PR I tried keeping the producer in a usable state after the forced
bump. I understand that it might be the cleanest solution, but the only
other option I know of is to transition into a fatal state, meaning that
the producer has to be recreated after a delivery timeout. I think that is
still fine compared to the out-of-order messages.


Looking forward to your reviews,
Daniel

Dániel Urbán <urb.dani...@gmail.com> ezt írta (időpont: 2022. júl. 7., Cs,
12:04):

> Thanks for the feedback, I created
> https://issues.apache.org/jira/browse/KAFKA-14053 and started working on
> a PR.
>
> Luke, for the workaround, we used the transaction admin tool released in
> 3.0 to "abort" these hanging batches manually.
> Naturally, the cluster health should be stabilized. This issue popped up
> most frequently around times when some partitions went into a few minute
> window of unavailability. The infinite retries on the producer side caused
> a situation where the last retry was still in-flight, but the delivery
> timeout was triggered on the client side. We reduced the retries and
> increased the delivery timeout to avoid such situations.
> Still, the issue can occur in other scenarios, for example a client
> queueing up many batches in the producer buffer, and causing those batches
> to spend most of the delivery timeout window in the client memory.
>
> Thanks,
> Daniel
>
> Luke Chen <show...@gmail.com> ezt írta (időpont: 2022. júl. 7., Cs, 5:13):
>
>> Hi Daniel,
>>
>> Thanks for reporting the issue, and the investigation.
>> I'm curious, so, what's your workaround for this issue?
>>
>> I agree with Artem, it makes sense. Please file a bug in JIRA.
>> And looking forward to your PR! :)
>>
>> Thank you.
>> Luke
>>
>> On Thu, Jul 7, 2022 at 3:07 AM Artem Livshits
>> <alivsh...@confluent.io.invalid> wrote:
>>
>> > Hi Daniel,
>> >
>> > What you say makes sense.  Could you file a bug and put this info there
>> so
>> > that it's easier to track?
>> >
>> > -Artem
>> >
>> > On Wed, Jul 6, 2022 at 8:34 AM Dániel Urbán <urb.dani...@gmail.com>
>> wrote:
>> >
>> > > Hello everyone,
>> > >
>> > > I've been investigating some transaction related issues in a very
>> > > problematic cluster. Besides finding some interesting issues, I had
>> some
>> > > ideas about how transactional producer behavior could be improved.
>> > >
>> > > My suggestion in short is: when the transactional producer encounters
>> an
>> > > error which doesn't necessarily mean that the in-flight request was
>> > > processed (for example a client side timeout), the producer should not
>> > send
>> > > an EndTxnRequest on abort, but instead it should bump the producer
>> epoch.
>> > >
>> > > The long description about the issue I found, and how I came to the
>> > > suggestion:
>> > >
>> > > First, the description of the issue. When I say that the cluster is
>> "very
>> > > problematic", I mean all kinds of different issues, be it infra (disks
>> > and
>> > > network) or throughput (high volume producers without fine tuning).
>> > > In this cluster, Kafka transactions are widely used by many producers.
>> > And
>> > > in this cluster, partitions get "stuck" frequently (few times every
>> > week).
>> > >
>> > > The exact meaning of a partition being "stuck" is this:
>> > >
>> > > On the client side:
>> > > 1. A transactional producer sends X batches to a partition in a single
>> > > transaction
>> > > 2. Out of the X batches, the last few get sent, but are timed out
>> thanks
>> > to
>> > > the delivery timeout config
>> > > 3. producer.flush() is unblocked due to all batches being "finished"
>> > > 4. Based on the errors reported in the producer.send() callback,
>> > > producer.abortTransaction() is called
>> > > 5. Then producer.close() is also invoked with a 5s timeout (this
>> > > application does not reuse the producer instances optimally)
>> > > 6. The transactional.id of the producer is never reused (it was
>> random
>> > > generated)
>> > >
>> > > On the partition leader side (what appears in the log segment of the
>> > > partition):
>> > > 1. The batches sent by the producer are all appended to the log
>> > > 2. But the ABORT marker of the transaction was appended before the
>> last 1
>> > > or 2 batches of the transaction
>> > >
>> > > On the transaction coordinator side (what appears in the transaction
>> > state
>> > > partition):
>> > > The transactional.id is present with the Empty state.
>> > >
>> > > These happenings result in the following:
>> > > 1. The partition leader handles the first batch after the ABORT
>> marker as
>> > > the first message of a new transaction of the same producer id +
>> epoch.
>> > > (LSO is blocked at this point)
>> > > 2. The transaction coordinator is not aware of any in-progress
>> > transaction
>> > > of the producer, thus never aborting the transaction, not even after
>> the
>> > > transaction.timeout.ms passes.
>> > >
>> > > This is happening with Kafka 2.5 running in the cluster, producer
>> > versions
>> > > range between 2.0 and 2.6.
>> > > I scanned through a lot of tickets, and I believe that this issue is
>> not
>> > > specific to these versions, and could happen with newest versions as
>> > well.
>> > > If I'm mistaken, some pointers would be appreciated.
>> > >
>> > > Assuming that the issue could occur with any version, I believe this
>> > issue
>> > > boils down to one oversight on the client side:
>> > > When a request fails without a definitive response (e.g. a delivery
>> > > timeout), the client cannot assume that the request is "finished", and
>> > > simply abort the transaction. If the request is still in flight, and
>> the
>> > > EndTxnRequest, then the WriteTxnMarkerRequest gets sent and processed
>> > > earlier, the contract is violated by the client.
>> > > This could be avoided by providing more information to the partition
>> > > leader. Right now, a new transactional batch signals the start of a
>> new
>> > > transaction, and there is no way for the partition leader to decide
>> > whether
>> > > the batch is an out-of-order message.
>> > > In a naive and wasteful protocol, we could have a unique transaction
>> id
>> > > added to each batch and marker, meaning that the leader would be
>> capable
>> > of
>> > > refusing batches which arrive after the control marker of the
>> > transaction.
>> > > But instead of changing the log format and the protocol, we can
>> achieve
>> > the
>> > > same by bumping the producer epoch.
>> > >
>> > > Bumping the epoch has a similar effect to "changing the transaction
>> id" -
>> > > the in-progress transaction will be aborted with a bumped producer
>> epoch,
>> > > telling the partition leader about the producer epoch change. From
>> this
>> > > point on, any batches sent with the old epoch will be refused by the
>> > leader
>> > > due to the fencing mechanism. It doesn't really matter how many
>> batches
>> > > will get appended to the log, and how many will be refused - this is
>> an
>> > > aborted transaction - but the out-of-order message cannot occur, and
>> > cannot
>> > > block the LSO infinitely.
>> > >
>> > > My suggestion is, that the TransactionManager inside the producer
>> should
>> > > keep track of what type of errors were encountered by the batches of
>> the
>> > > transaction, and categorize them along the lines of "definitely
>> > completed"
>> > > and "might not be completed". When the transaction goes into an
>> abortable
>> > > state, and there is at least one batch with "might not be completed",
>> the
>> > > EndTxnRequest should be skipped, and an epoch bump should be sent.
>> > > As for what type of error counts as "might not be completed", I can
>> only
>> > > think of client side timeouts.
>> > >
>> > > I believe this is a relatively small change (only affects the client
>> > lib),
>> > > but it helps in avoiding some corrupt states in Kafka transactions.
>> > >
>> > > Looking forward to your input. If it seems like a sane idea, I go
>> ahead
>> > and
>> > > submit a PR for it as well.
>> > >
>> > > Thanks in advance,
>> > > Daniel
>> > >
>> >
>>
>

Re: Transactions, delivery timeout and changing transactional producer behavior

Reply via email to