Hi, I'm having trouble with some recurring stalling connections to kafka. What I see as a symptom is that some consumers lag behind and most times restarting the consumer doesn't help. (occasionally when some other consumer tries to take the problematic partition it no longer fails, but mostly even when it switches consumer it stalls shortly after).
Doing a thread dump of this situation I see that the call stalls in the hasNext() method of the ConsumerIterator, although it has many messages to consume and that particular partition for that topic is lagged. "hermes-consumer-thread-1" #75 prio=5 os_prio=0 tid=0x00007fe430fde000 nid=0x7c01 waiting on condition [0x00007fe428ce1000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000070932c870> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:65) at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33) at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66) at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58) Reading through the mailing list I've come accross old solutions for this problem, including checking the consumer.timeout.ms (which i've added with no results) and checking the size of the messages (if the message is bigger than fetch.message.max.bytes it will stop like this) but my messages are all under 300 bytes in size. Have anyone had this problem? Any help would be appreciated Thanks