[ https://issues.apache.org/jira/browse/KAFKA-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833869#comment-15833869 ]
Vishal Shukla commented on KAFKA-4676: -------------------------------------- Hi Jason, Thank you very much for immediate actions on this. Consumer logs on consumer-node-01 & consumer-node-02 when topic gets stuck are attached in [^stuck-consumer-node-1.log] and [^consumer-node-2.log] respectively. This is around 2017-01-21 03:45 CET. Config as of this time: {code} session.timeout.ms=15000 max.poll.interval.ms=300000 max.poll.records=500 request.timeout.ms=3050000 {code} Then restarting consumer-node-02 service triggered rebalancing appropriately and the messages were consumed fine. Also attached the logs when triggering restart for both nodes as [^restart-node2-consumer-node-2.log] & [^restart-node2-consumer-node-1.log]. This stayed normal for few hours till around 2017-01-21 13:21 CET. This time the case seemed to be little different than previous case. There were no kafka logs in consumer-node-2. However, consumer-node-1 constantly had logs about rejoining, assigning partitions and warning about config as shown in [^stuck-case2.log]. After this case, we changed {{session.timeout.ms}} to {{300000}} and {{max.poll.records}} to {{100}}. This gets rid of the warning and we still occasionally observe the rejoining & assignment logs in consumers. {code} 2017-01-22 03:30:05,919 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Revoking previously assigned partitions [] for group event-saved-group 2017-01-22 03:30:05,919 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - (Re-)joining group event-saved-group 2017-01-22 03:30:06,692 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Revoking previously assigned partitions [event-saved-prod-2-8] for group event-saved-group 2017-01-22 03:30:06,692 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - (Re-)joining group event-saved-group 2017-01-22 03:30:06,720 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Revoking previously assigned partitions [event-saved-prod-2-2] for group event-saved-group 2017-01-22 03:30:06,720 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - (Re-)joining group event-saved-group 2017-01-22 03:30:06,720 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Revoking previously assigned partitions [event-saved-prod-2-7] for group event-saved-group 2017-01-22 03:30:06,720 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - (Re-)joining group event-saved-group 2017-01-22 03:30:07,956 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Successfully joined group event-saved-group with generation 81 2017-01-22 03:30:07,956 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Setting newly assigned partitions [event-saved-prod-2-5] for group event-saved-group 2017-01-22 03:30:07,957 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Successfully joined group event-saved-group with generation 81 2017-01-22 03:30:07,957 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Setting newly assigned partitions [event-saved-prod-2-2] for group event-saved-group 2017-01-22 03:30:07,960 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Successfully joined group event-saved-group with generation 81 2017-01-22 03:30:07,960 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Successfully joined group event-saved-group with generation 81 2017-01-22 03:30:07,960 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Setting newly assigned partitions [] for group event-saved-group 2017-01-22 03:30:07,960 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Setting newly assigned partitions [event-saved-prod-2-6] for group event-saved-group 2017-01-22 03:30:10,958 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Revoking previously assigned partitions [event-saved-prod-2-5] for group event-saved-group 2017-01-22 03:30:10,958 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - (Re-)joining group event-saved-group 2017-01-22 03:30:10,971 [INFO] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Revoking previously assigned partitions [] for group event-saved-group 2017-01-22 03:30:10,971 [INFO] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - (Re-)joining group event-saved-group {code} Do you see these logs as unusual? Also notice that we are running the app against single machine ZK & Kafka (both being in the same VM). We are aware that it isn't recommended and we should have ZK separately in quorum. However, I assume that may not have anything to do with the current issue. Please let me know if that is incorrect or you need more info related to that. Regarding which partitions are getting stuck, it is some times related to the partitions in 1 consumer and at many times all partitions also get stuck. I need to wait for the issue to recur in order to give you exact partition along with respective logs. Will post it once I observe the issue again. I hope this info helps. Please let us know if any other information can help. > Kafka consumers gets stuck for some partitions > ---------------------------------------------- > > Key: KAFKA-4676 > URL: https://issues.apache.org/jira/browse/KAFKA-4676 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.1.0 > Reporter: Vishal Shukla > Priority: Critical > Labels: consumer, reliability > Attachments: restart-node2-consumer-node-1.log, > restart-node2-consumer-node-2.log, stuck-case2.log, > stuck-consumer-node-1.log, stuck-consumer-node-2.log, > stuck-topic-thread-dump.log > > > We recently upgraded to Kafka 0.10.1.0. We are frequently facing issue that > Kafka consumers get stuck suddenly for some partitions. > Attached thread dump. -- This message was sent by Atlassian JIRA (v6.3.4#6332)