[
https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guozhang Wang updated KAFKA-13375:
----------------------------------
Component/s: (was: streams)
> Kafka streams apps w/EOS unable to start at InitProducerId
> ----------------------------------------------------------
>
> Key: KAFKA-13375
> URL: https://issues.apache.org/jira/browse/KAFKA-13375
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 2.8.0
> Reporter: Lerh Chuan Low
> Priority: Major
>
> Hello, I'm wondering if this is a Kafka bug. Our environment setup is as
> follows:
> Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I
> don't think the changes introduced in EOS beta should affect this).
> Transaction timeout = 60s.
> Kafka broker 2.8.
> We have this situation where we were doing a rolling restart of the broker to
> apply some security changes. After we finished, 4 out of some 15 Stream Apps
> are unable to start. They can never succeed, no matter what we do.
> They fail with the error:
> {code:java}
> 2021-10-14 07:20:13,548 WARN
> [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3]
> o.a.k.s.p.i.StreamsProducer stream-thread
> [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task
> [0_6] Timeout exception caught trying to initialize transactions. The broker
> is either slow or in bad state (like not having enough replicas) in
> responding to the request, or the connection to broker was interrupted
> sending the request or receiving the response. Will retry initializing the
> task in the next loop. Consider overwriting max.block.ms to a larger value to
> avoid timeout errors{code}
> We found a previous Jira describing the issue here:
> https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then
> what people did was to rolling restart the brokers. We tried that - we
> targeted the group coordinators for our failing apps, then transaction
> coordinators, then all of them. It hasn't resolved our issue so far.
> A few interesting things we've found so far:
> - What I can see is that all the failing apps only fail on certain
> partitions. E.g for the app above, only partition 6 never succeeds. Partition
> 6 shares the same coordinator as some of the other partitions and those work,
> so it seems like the issue isn't related to broker memory state.
> - All the failing apps have a message similar to this
> {code:java}
> [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler
> 103]: Sending srn-rec-feeder-0_6's transaction marker for partition
> srn-bot-003-14 has permanently failed with error
> org.apache.kafka.common.errors.InvalidProducerEpochException with the current
> coordinator epoch 143; cancel sending any more transaction markers
> TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143,
> result=ABORT, partitions=[srn-bot-003-14]} to the brokers
> (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler)
> {code}
> While we were restarting the brokers. They all failed shortly after. No other
> consumer groups for the other working partitions/working stream apps logged
> this message.
> On digging around in git blame and reading through the source, it looks like
> this is meant to be benign.
> - We tried DEBUG logging for the TransactionCoordinator and
> TransactionStateManager. We can see (assigner is a functioning app)
> {code:java}
> [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning
> CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's
> AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator)
> {code}
> I've seen those before during steady state. I do believe they are benign. We
> never see it for the problematic partitions/consumer groups for some reason.
> - We tried turning on TRACE for KafkaApis. We can see
> {code:java}
> [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's
> InitProducerIdRequest with result
> InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client
> srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer.
> (kafka.server.KafkaApis) {code}
> It starts to make me wonder if there's a situation where Kafka is unable to
> abort the transactions if there is never any success in initializing a
> producer ID. But this is diving deep into insider knowledge territory that I
> simply don't have. I'm wondering if anyone with more knowledge of how
> transactions work can shed some light here if we are in the wrong path, or if
> there's any way to restore operations at all short of a streams reset? From a
> cursory look at
> {noformat}
> TransactionCoordinator#handleInitProducerId {noformat}
> it looks like any old transactions should just be aborted and life goes on,
> but it's not happening.
> I wonder if there's some corrupted state of the transaction Kafka log files.
> I'm not sure if it's possible to edit them or how to resolve the issue though
> if that's true.
> Update:
> We temporarily managed to get our producers/consumers working again by
> removing EOS and making them ALO (At least once).
> After we had left it running for a bit (> 15 minutes), we decided to turn on
> EOS again to see if the problem had gone away. It still failed with the exact
> same error.
> So we've turned it back to ALO. Not sure if that helps or gives a clue into
> what may have gone awry.
> Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)