Hello, We have Kafka 0.10.0.1 running on a 3 broker cluster. We have an application which consumes from a topic having 10 partitions. 10 consumers are spawned from this process, they belong to one consumer group.
What we have observed is that very frequently we are observing such messages in consumer logs [2018-08-21 11:12:46] :: WARN :: ConsumerCoordinator:554 - Auto offset commit failed for group otp-email-consumer: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. [2018-08-21 11:12:46] :: INFO :: ConsumerCoordinator:333 - Revoking previously assigned partitions [otp-email-1, otp-email-0, otp-email-3, otp-email-2] for group otp-email-consumer [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator:381 - (Re-)joining group otp-email-consumer [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator:600 - *Marking the coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for group otp-email-consumer* [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator:600 - *Marking the coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for group otp-email-consumer* [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator$GroupCoordinatorResponseHandler:555 - Discovered coordinator 10.189.179.117:9092 (id: 2147483646 rack: null) for group otp-email-consumer. [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator:381 - (Re-)joining group otp-email-consumer After this, the group enters rebalancing phase and it takes about 5-10 minutes to start consuming messages again. What does this message mean? The actual broker doesn't go down as per our monitoring tools. So how come it is declared dead? Please help, I am stuck on this issue since 2 months now. Here's our consumer configuration auto.commit.interval.ms = 3000 auto.offset.reset = latest bootstrap.servers = [x.x.x.x:9092, x.x.x.x:9092, x.x.x.x:9092] check.crcs = true client.id = connections.max.idle.ms = 540000 enable.auto.commit = true exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = otp-notifications-consumer heartbeat.interval.ms = 3000 interceptor.classes = null key.deserializer = class org.apache.kafka.common.serialization. StringDeserializer max.partition.fetch.bytes = 1048576 max.poll.interval.ms = 300000 max.poll.records = 50 metadata.max.age.ms = 300000 metric.reporters = [] metrics.num.samples = 2 metrics.sample.window.ms = 30000 partition.assignment.strategy = [class org.apache.kafka.clients. consumer.RangeAssignor] receive.buffer.bytes = 65536 reconnect.backoff.ms = 50 request.timeout.ms = 305000 retry.backoff.ms = 100 sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 60000 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.mechanism = GSSAPI security.protocol = SSL send.buffer.bytes = 131072 session.timeout.ms = 300000 ssl.cipher.suites = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.endpoint.identification.algorithm = null ssl.key.password = null ssl.keymanager.algorithm = SunX509 ssl.keystore.location = null ssl.keystore.password = null ssl.keystore.type = JKS ssl.protocol = TLS ssl.provider = null ssl.secure.random.implementation = null ssl.trustmanager.algorithm = PKIX ssl.truststore.location = /x/x/client.truststore.jks ssl.truststore.password = [hidden] ssl.truststore.type = JKS value.deserializer = class org.apache.kafka.common.serialization. StringDeserializer