Marek Svitok created KAFKA-4176: ----------------------------------- Summary: Node stopped receiving heartbeat responses once another node started within the same group Key: KAFKA-4176 URL: https://issues.apache.org/jira/browse/KAFKA-4176 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 0.10.0.1 Environment: Centos 7: 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Java: java version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode) Reporter: Marek Svitok I have 3 nodes working in the same group. I started them one after the other. As I can see from the log the node once started receives heartbeat responses for the group it is part of. However once I start another node the former one stops receiving these responses and the new one keeps receiving them: Node0 03:14:36.224 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:14:39.223 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:14:39.224 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:14:39.429 [main-SendThread(mujsignal-03:2182)] DEBUG org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x256bc1ce8c30170 after 0ms 03:14:39.462 [main-SendThread(mujsignal-03:2182)] DEBUG org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x256bc1ce8c30171 after 0ms 03:14:42.224 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:14:42.224 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:14:45.224 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:14:45.224 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:14:48.224 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Attempt to heart beat failed for group test_streams_id since it is rebalancing. 03:14:48.224 [StreamThread-2] INFO o.a.k.c.c.i.ConsumerCoordinator - Revoking previously assigned partitions [StreamTopic-2] for group test_streams_id 03:14:48.224 [StreamThread-2] INFO o.a.k.s.p.internals.StreamThread - Removing a task 0_2 Node1 03:22:18.710 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:18.716 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:21.709 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:21.716 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:24.710 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:24.717 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:24.872 [main-SendThread(mujsignal-03:2182)] DEBUG org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x256bc1ce8c30172 after 0ms 03:22:24.992 [main-SendThread(mujsignal-03:2182)] DEBUG org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x256bc1ce8c30173 after 0ms 03:22:27.710 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:27.717 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:30.710 [StreamThread-2] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id 03:22:30.716 [StreamThread-1] DEBUG o.a.k.c.c.i.AbstractCoordinator - Received successful heartbeat response for group test_streams_id Configuration used: 03:14:24.520 [main] INFO o.a.k.c.producer.ProducerConfig - ProducerConfig values: metric.reporters = [] metadata.max.age.ms = 300000 reconnect.backoff.ms = 50 sasl.kerberos.ticket.renew.window.factor = 0.8 bootstrap.servers = [mujsignal-03:9092, mujsignal-09:9093] ssl.keystore.type = JKS sasl.mechanism = GSSAPI max.block.ms = 60000 interceptor.classes = null ssl.truststore.password = null client.id = Test-Streams-Processor-StreamThread-2-producer ssl.endpoint.identification.algorithm = null request.timeout.ms = 30000 acks = 1 receive.buffer.bytes = 32768 ssl.truststore.type = JKS retries = 0 ssl.truststore.location = null ssl.keystore.password = null send.buffer.bytes = 131072 compression.type = none metadata.fetch.timeout.ms = 60000 retry.backoff.ms = 100 sasl.kerberos.kinit.cmd = /usr/bin/kinit buffer.memory = 33554432 timeout.ms = 30000 key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 ssl.trustmanager.algorithm = PKIX block.on.buffer.full = false ssl.key.password = null sasl.kerberos.min.time.before.relogin = 60000 connections.max.idle.ms = 540000 max.in.flight.requests.per.connection = 5 metrics.num.samples = 2 ssl.protocol = TLS ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] batch.size = 16384 ssl.keystore.location = null ssl.cipher.suites = null security.protocol = PLAINTEXT max.request.size = 1048576 value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer ssl.keymanager.algorithm = SunX509 metrics.sample.window.ms = 30000 partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner linger.ms = 100 03:14:24.547 [main] INFO o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values: metric.reporters = [] metadata.max.age.ms = 300000 partition.assignment.strategy = [org.apache.kafka.streams.processor.internals.StreamPartitionAssignor] reconnect.backoff.ms = 50 sasl.kerberos.ticket.renew.window.factor = 0.8 max.partition.fetch.bytes = 1048576 bootstrap.servers = [mujsignal-03:9092, mujsignal-09:9093] ssl.keystore.type = JKS enable.auto.commit = false sasl.mechanism = GSSAPI interceptor.classes = null exclude.internal.topics = true ssl.truststore.password = null client.id = Test-Streams-Processor-StreamThread-2-consumer ssl.endpoint.identification.algorithm = null max.poll.records = 2147483647 check.crcs = true request.timeout.ms = 40000 heartbeat.interval.ms = 3000 auto.commit.interval.ms = 5000 receive.buffer.bytes = 65536 ssl.truststore.type = JKS ssl.truststore.location = null ssl.keystore.password = null fetch.min.bytes = 1 send.buffer.bytes = 131072 value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer group.id = test_streams_id retry.backoff.ms = 100 sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 ssl.trustmanager.algorithm = PKIX ssl.key.password = null fetch.max.wait.ms = 500 sasl.kerberos.min.time.before.relogin = 60000 connections.max.idle.ms = 540000 session.timeout.ms = 30000 metrics.num.samples = 2 key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer ssl.protocol = TLS ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.keystore.location = null ssl.cipher.suites = null security.protocol = PLAINTEXT ssl.keymanager.algorithm = SunX509 metrics.sample.window.ms = 30000 auto.offset.reset = earliest -- This message was sent by Atlassian JIRA (v6.3.4#6332)