Hello,

We have Kafka 0.10.0.1 running on a 3 broker cluster. We have an
application which consumes from a topic having 10 partitions. 10 consumers
are spawned from this process, they belong to one consumer group.

What we have observed is that very frequently we are observing such
messages in consumer logs

[2018-08-21 11:12:46] :: WARN  :: ConsumerCoordinator:554 - Auto offset
commit failed for group otp-email-consumer: Commit cannot be completed
since the group has already rebalanced and assigned the partitions to
another member. This means that the time between subsequent calls to poll()
was longer than the configured max.poll.interval.ms, which typically
implies that the poll loop is spending too much time message processing.
You can address this either by increasing the session timeout or by
reducing the maximum size of batches returned in poll() with
max.poll.records.
[2018-08-21 11:12:46] :: INFO  :: ConsumerCoordinator:333 - Revoking
previously assigned partitions [otp-email-1, otp-email-0, otp-email-3,
otp-email-2] for group otp-email-consumer
[2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:381 - (Re-)joining
group otp-email-consumer
[2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:600 - *Marking the
coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for group
otp-email-consumer*
[2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:600 - *Marking the
coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for group
otp-email-consumer*
[2018-08-21 11:12:46] :: INFO  ::
AbstractCoordinator$GroupCoordinatorResponseHandler:555 - Discovered
coordinator 10.189.179.117:9092 (id: 2147483646 rack: null) for group
otp-email-consumer.
[2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:381 - (Re-)joining
group otp-email-consumer

After this, the group enters rebalancing phase and it takes about 5-10
minutes to start consuming messages again.
What does this message mean? The actual broker doesn't  go down as per our
monitoring tools. So how come it is declared dead? Please help, I am stuck
on this issue since 2 months now.

Here's our consumer configuration
auto.commit.interval.ms = 3000
auto.offset.reset = latest
bootstrap.servers = [x.x.x.x:9092, x.x.x.x:9092, x.x.x.x:9092]
check.crcs = true
client.id =
connections.max.idle.ms = 540000
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = otp-notifications-consumer
heartbeat.interval.ms = 3000
interceptor.classes = null
key.deserializer = class org.apache.kafka.common.serialization.
StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 50
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.
consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.ms = 50
request.timeout.ms = 305000
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = GSSAPI
security.protocol = SSL
send.buffer.bytes = 131072
session.timeout.ms = 300000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = /x/x/client.truststore.jks
ssl.truststore.password = [hidden]
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.
StringDeserializer

Reply via email to