Re: Frequent appearance of "Marking the coordinator dead" message in consumer log

Steve Tian Wed, 22 Aug 2018 03:56:23 -0700

NVM.  What's your client version?  I'm asking as max.poll.interval.ms
should be introduced since 0.10.1.0, which is not the version you mentioned
in the email thread.


On Wed, Aug 22, 2018, 6:51 PM Shantanu Deshmukh <[email protected]>
wrote:

> How do I check for GC pausing?
>
> On Wed, Aug 22, 2018 at 4:12 PM Steve Tian <[email protected]>
> wrote:
>
> > Did you observed any GC-pausing?
> >
> > On Wed, Aug 22, 2018, 6:38 PM Shantanu Deshmukh <[email protected]>
> > wrote:
> >
> > > Hi Steve,
> > >
> > > Application is just sending mails. Every record is just a email request
> > > with very basic business logic. Generally it doesn't take more than
> 200ms
> > > to process a single mail. Currently it is averaging out at 70-80 ms.
> > >
> > > On Wed, Aug 22, 2018 at 3:06 PM Steve Tian <[email protected]>
> > > wrote:
> > >
> > > > How long did it take to process 50 `ConsumerRecord`s?
> > > >
> > > > On Wed, Aug 22, 2018, 5:16 PM Shantanu Deshmukh <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > We have Kafka 0.10.0.1 running on a 3 broker cluster. We have an
> > > > > application which consumes from a topic having 10 partitions. 10
> > > > consumers
> > > > > are spawned from this process, they belong to one consumer group.
> > > > >
> > > > > What we have observed is that very frequently we are observing such
> > > > > messages in consumer logs
> > > > >
> > > > > [2018-08-21 11:12:46] :: WARN  :: ConsumerCoordinator:554 - Auto
> > offset
> > > > > commit failed for group otp-email-consumer: Commit cannot be
> > completed
> > > > > since the group has already rebalanced and assigned the partitions
> to
> > > > > another member. This means that the time between subsequent calls
> to
> > > > poll()
> > > > > was longer than the configured max.poll.interval.ms, which
> typically
> > > > > implies that the poll loop is spending too much time message
> > > processing.
> > > > > You can address this either by increasing the session timeout or by
> > > > > reducing the maximum size of batches returned in poll() with
> > > > > max.poll.records.
> > > > > [2018-08-21 11:12:46] :: INFO  :: ConsumerCoordinator:333 -
> Revoking
> > > > > previously assigned partitions [otp-email-1, otp-email-0,
> > otp-email-3,
> > > > > otp-email-2] for group otp-email-consumer
> > > > > [2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:381 -
> > > (Re-)joining
> > > > > group otp-email-consumer
> > > > > [2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:600 -
> *Marking
> > > the
> > > > > coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for group
> > > > > otp-email-consumer*
> > > > > [2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:600 -
> *Marking
> > > the
> > > > > coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for group
> > > > > otp-email-consumer*
> > > > > [2018-08-21 11:12:46] :: INFO  ::
> > > > > AbstractCoordinator$GroupCoordinatorResponseHandler:555 -
> Discovered
> > > > > coordinator 10.189.179.117:9092 (id: 2147483646 rack: null) for
> > group
> > > > > otp-email-consumer.
> > > > > [2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:381 -
> > > (Re-)joining
> > > > > group otp-email-consumer
> > > > >
> > > > > After this, the group enters rebalancing phase and it takes about
> > 5-10
> > > > > minutes to start consuming messages again.
> > > > > What does this message mean? The actual broker doesn't  go down as
> > per
> > > > our
> > > > > monitoring tools. So how come it is declared dead? Please help, I
> am
> > > > stuck
> > > > > on this issue since 2 months now.
> > > > >
> > > > > Here's our consumer configuration
> > > > > auto.commit.interval.ms = 3000
> > > > > auto.offset.reset = latest
> > > > > bootstrap.servers = [x.x.x.x:9092, x.x.x.x:9092, x.x.x.x:9092]
> > > > > check.crcs = true
> > > > > client.id =
> > > > > connections.max.idle.ms = 540000
> > > > > enable.auto.commit = true
> > > > > exclude.internal.topics = true
> > > > > fetch.max.bytes = 52428800
> > > > > fetch.max.wait.ms = 500
> > > > > fetch.min.bytes = 1
> > > > > group.id = otp-notifications-consumer
> > > > > heartbeat.interval.ms = 3000
> > > > > interceptor.classes = null
> > > > > key.deserializer = class org.apache.kafka.common.serialization.
> > > > > StringDeserializer
> > > > > max.partition.fetch.bytes = 1048576
> > > > > max.poll.interval.ms = 300000
> > > > > max.poll.records = 50
> > > > > metadata.max.age.ms = 300000
> > > > > metric.reporters = []
> > > > > metrics.num.samples = 2
> > > > > metrics.sample.window.ms = 30000
> > > > > partition.assignment.strategy = [class org.apache.kafka.clients.
> > > > > consumer.RangeAssignor]
> > > > > receive.buffer.bytes = 65536
> > > > > reconnect.backoff.ms = 50
> > > > > request.timeout.ms = 305000
> > > > > retry.backoff.ms = 100
> > > > > sasl.kerberos.kinit.cmd = /usr/bin/kinit
> > > > > sasl.kerberos.min.time.before.relogin = 60000
> > > > > sasl.kerberos.service.name = null
> > > > > sasl.kerberos.ticket.renew.jitter = 0.05
> > > > > sasl.kerberos.ticket.renew.window.factor = 0.8
> > > > > sasl.mechanism = GSSAPI
> > > > > security.protocol = SSL
> > > > > send.buffer.bytes = 131072
> > > > > session.timeout.ms = 300000
> > > > > ssl.cipher.suites = null
> > > > > ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
> > > > > ssl.endpoint.identification.algorithm = null
> > > > > ssl.key.password = null
> > > > > ssl.keymanager.algorithm = SunX509
> > > > > ssl.keystore.location = null
> > > > > ssl.keystore.password = null
> > > > > ssl.keystore.type = JKS
> > > > > ssl.protocol = TLS
> > > > > ssl.provider = null
> > > > > ssl.secure.random.implementation = null
> > > > > ssl.trustmanager.algorithm = PKIX
> > > > > ssl.truststore.location = /x/x/client.truststore.jks
> > > > > ssl.truststore.password = [hidden]
> > > > > ssl.truststore.type = JKS
> > > > > value.deserializer = class org.apache.kafka.common.serialization.
> > > > > StringDeserializer
> > > > >
> > > >
> > >
> >
>

Re: Frequent appearance of "Marking the coordinator dead" message in consumer log

Reply via email to