[jira] [Updated] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId

Guozhang Wang (Jira) Wed, 03 Nov 2021 17:41:07 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guozhang Wang updated KAFKA-13375:
----------------------------------
    Component/s:     (was: streams)

> Kafka streams apps w/EOS unable to start at InitProducerId
> ----------------------------------------------------------
>
>                 Key: KAFKA-13375
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13375
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.8.0
>            Reporter: Lerh Chuan Low
>            Priority: Major
>
> Hello, I'm wondering if this is a Kafka bug. Our environment setup is as 
> follows:
> Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I 
> don't think the changes introduced in EOS beta should affect this). 
> Transaction timeout = 60s.
>  Kafka broker 2.8. 
> We have this situation where we were doing a rolling restart of the broker to 
> apply some security changes. After we finished, 4 out of some 15 Stream Apps 
> are unable to start. They can never succeed, no matter what we do. 
> They fail with the error:
> {code:java}
>  2021-10-14 07:20:13,548 WARN 
> [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] 
> o.a.k.s.p.i.StreamsProducer stream-thread 
> [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task 
> [0_6] Timeout exception caught trying to initialize transactions. The broker 
> is either slow or in bad state (like not having enough replicas) in 
> responding to the request, or the connection to broker was interrupted 
> sending the request or receiving the response. Will retry initializing the 
> task in the next loop. Consider overwriting max.block.ms to a larger value to 
> avoid timeout errors{code}
> We found a previous Jira describing the issue here: 
> https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then 
> what people did was to rolling restart the brokers. We tried that - we 
> targeted the group coordinators for our failing apps, then transaction 
> coordinators, then all of them. It hasn't resolved our issue so far. 
> A few interesting things we've found so far:
>  - What I can see is that all the failing apps only fail on certain 
> partitions. E.g for the app above, only partition 6 never succeeds. Partition 
> 6 shares the same coordinator as some of the other partitions and those work, 
> so it seems like the issue isn't related to broker memory state. 
>  - All the failing apps have a message similar to this 
> {code:java}
> [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler 
> 103]: Sending srn-rec-feeder-0_6's transaction marker for partition 
> srn-bot-003-14 has permanently failed with error 
> org.apache.kafka.common.errors.InvalidProducerEpochException with the current 
> coordinator epoch 143; cancel sending any more transaction markers 
> TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, 
> result=ABORT, partitions=[srn-bot-003-14]} to the brokers 
> (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) 
> {code}
> While we were restarting the brokers. They all failed shortly after. No other 
> consumer groups for the other working partitions/working stream apps logged 
> this message. 
> On digging around in git blame and reading through the source, it looks like 
> this is meant to be benign. 
>  - We tried DEBUG logging for the TransactionCoordinator and 
> TransactionStateManager. We can see (assigner is a functioning app)
> {code:java}
> [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning 
> CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's 
> AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) 
> {code}
> I've seen those before during steady state. I do believe they are benign. We 
> never see it for the problematic partitions/consumer groups for some reason.
>  - We tried turning on TRACE for KafkaApis. We can see
> {code:java}
> [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's 
> InitProducerIdRequest with result 
> InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client 
> srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer.
>  (kafka.server.KafkaApis) {code}
> It starts to make me wonder if there's a situation where Kafka is unable to 
> abort the transactions if there is never any success in initializing a 
> producer ID. But this is diving deep into insider knowledge territory that I 
> simply don't have. I'm wondering if anyone with more knowledge of how 
> transactions work can shed some light here if we are in the wrong path, or if 
> there's any way to restore operations at all short of a streams reset? From a 
> cursory look at
> {noformat}
> TransactionCoordinator#handleInitProducerId {noformat}
> it looks like any old transactions should just be aborted and life goes on, 
> but it's not happening. 
> I wonder if there's some corrupted state of the transaction Kafka log files. 
> I'm not sure if it's possible to edit them or how to resolve the issue though 
> if that's true. 
> Update:
> We temporarily managed to get our producers/consumers working again by 
> removing EOS and making them ALO (At least once). 
> After we had left it running for a bit (> 15 minutes), we decided to turn on 
> EOS again to see if the problem had gone away. It still failed with the exact 
> same error. 
> So we've turned it back to ALO. Not sure if that helps or gives a clue into 
> what may have gone awry.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId

Reply via email to