[ 
https://issues.apache.org/jira/browse/KAFKA-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888126#comment-15888126
 ] 

Federico Giraud commented on KAFKA-4669:
----------------------------------------

We had the same issue with a Kafka 0.8.2.0 producer and a Kafka 0.9 cluster. 
The bug seems to have been triggered by a change in the ISR (a broker fell out 
of sync and cause a rebalance on multiple topics). A producer that was sending 
messages to multiple topics in the cluster reporter the following exception 
multiple times, within a few seconds:

{code}
[2017-02-19 06:02:05,172] ERROR {kafka-producer-network-thread | egress} 
org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
producer I/O thread:
java.lang.IllegalStateException: Correlation id for response (90177743) does 
not match request (90177741)
        at 
org.apache.kafka.clients.NetworkClient.correlate(NetworkClient.java:356)
        at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:296)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:199)
        at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191)
        at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:122)
        at java.lang.Thread.run(Thread.java:745)
{code}

There were no producer errors preceding the sequence of IllegalStateExceptions. 
We have numerous other producers running and none of them reported the issue. 
The only solution was to terminate the process and restarting it.

Please let me know if you need any additional information.

> KafkaProducer.flush hangs when NetworkClient.handleCompletedReceives throws 
> exception
> -------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4669
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4669
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 0.9.0.1
>            Reporter: Cheng Ju
>            Priority: Critical
>              Labels: reliability
>             Fix For: 0.10.3.0
>
>
> There is no try catch in NetworkClient.handleCompletedReceives.  If an 
> exception is thrown after inFlightRequests.completeNext(source), then the 
> corresponding RecordBatch's done will never get called, and 
> KafkaProducer.flush will hang on this RecordBatch.
> I've checked 0.10 code and think this bug does exist in 0.10 versions.
> A real case.  First a correlateId not match exception happens:
> 13 Jan 2017 17:08:24,059 ERROR [kafka-producer-network-thread | producer-21] 
> (org.apache.kafka.clients.producer.internals.Sender.run:130)  - Uncaught 
> error in kafka producer I/O thread: 
> java.lang.IllegalStateException: Correlation id for response (703766) does 
> not match request (703764)
>       at 
> org.apache.kafka.clients.NetworkClient.correlate(NetworkClient.java:477)
>       at 
> org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:440)
>       at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:265)
>       at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:216)
>       at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:128)
>       at java.lang.Thread.run(Thread.java:745)
> Then jstack shows the thread is hanging on:
>       at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>       at 
> org.apache.kafka.clients.producer.internals.ProduceRequestResult.await(ProduceRequestResult.java:57)
>       at 
> org.apache.kafka.clients.producer.internals.RecordAccumulator.awaitFlushCompletion(RecordAccumulator.java:425)
>       at 
> org.apache.kafka.clients.producer.KafkaProducer.flush(KafkaProducer.java:544)
>       at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:224)
>       at 
> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
>       at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
>       at java.lang.Thread.run(Thread.java:745)
> client code 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to