Re: Frequent appearance of "Marking the coordinator dead" message in consumer log

Shantanu Deshmukh Wed, 22 Aug 2018 05:18:02 -0700

I know average time of processing one record, it is about 70-80ms. I have
set session.timeout.ms so high total processing time for one poll
invocation should be well within it.


On Wed, Aug 22, 2018 at 5:04 PM Steve Tian <steve.cs.t...@gmail.com> wrote:

> Have you measured the duration between two `poll` invocations and the size
> of returned `ConsumrRecords`?
>
> On Wed, Aug 22, 2018, 7:00 PM Shantanu Deshmukh <shantanu...@gmail.com>
> wrote:
>
> > Ohh sorry, my bad. Kafka version is 0.10.1.0 indeed and so is the client.
> >
> > On Wed, Aug 22, 2018 at 4:26 PM Steve Tian <steve.cs.t...@gmail.com>
> > wrote:
> >
> > > NVM.  What's your client version?  I'm asking as max.poll.interval.ms
> > > should be introduced since 0.10.1.0, which is not the version you
> > mentioned
> > > in the email thread.
> > >
> > > On Wed, Aug 22, 2018, 6:51 PM Shantanu Deshmukh <shantanu...@gmail.com
> >
> > > wrote:
> > >
> > > > How do I check for GC pausing?
> > > >
> > > > On Wed, Aug 22, 2018 at 4:12 PM Steve Tian <steve.cs.t...@gmail.com>
> > > > wrote:
> > > >
> > > > > Did you observed any GC-pausing?
> > > > >
> > > > > On Wed, Aug 22, 2018, 6:38 PM Shantanu Deshmukh <
> > shantanu...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Steve,
> > > > > >
> > > > > > Application is just sending mails. Every record is just a email
> > > request
> > > > > > with very basic business logic. Generally it doesn't take more
> than
> > > > 200ms
> > > > > > to process a single mail. Currently it is averaging out at 70-80
> > ms.
> > > > > >
> > > > > > On Wed, Aug 22, 2018 at 3:06 PM Steve Tian <
> > steve.cs.t...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > How long did it take to process 50 `ConsumerRecord`s?
> > > > > > >
> > > > > > > On Wed, Aug 22, 2018, 5:16 PM Shantanu Deshmukh <
> > > > shantanu...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > We have Kafka 0.10.0.1 running on a 3 broker cluster. We have
> > an
> > > > > > > > application which consumes from a topic having 10 partitions.
> > 10
> > > > > > > consumers
> > > > > > > > are spawned from this process, they belong to one consumer
> > group.
> > > > > > > >
> > > > > > > > What we have observed is that very frequently we are
> observing
> > > such
> > > > > > > > messages in consumer logs
> > > > > > > >
> > > > > > > > [2018-08-21 11:12:46] :: WARN  :: ConsumerCoordinator:554 -
> > Auto
> > > > > offset
> > > > > > > > commit failed for group otp-email-consumer: Commit cannot be
> > > > > completed
> > > > > > > > since the group has already rebalanced and assigned the
> > > partitions
> > > > to
> > > > > > > > another member. This means that the time between subsequent
> > calls
> > > > to
> > > > > > > poll()
> > > > > > > > was longer than the configured max.poll.interval.ms, which
> > > > typically
> > > > > > > > implies that the poll loop is spending too much time message
> > > > > > processing.
> > > > > > > > You can address this either by increasing the session timeout
> > or
> > > by
> > > > > > > > reducing the maximum size of batches returned in poll() with
> > > > > > > > max.poll.records.
> > > > > > > > [2018-08-21 11:12:46] :: INFO  :: ConsumerCoordinator:333 -
> > > > Revoking
> > > > > > > > previously assigned partitions [otp-email-1, otp-email-0,
> > > > > otp-email-3,
> > > > > > > > otp-email-2] for group otp-email-consumer
> > > > > > > > [2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:381 -
> > > > > > (Re-)joining
> > > > > > > > group otp-email-consumer
> > > > > > > > [2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:600 -
> > > > *Marking
> > > > > > the
> > > > > > > > coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for
> > > group
> > > > > > > > otp-email-consumer*
> > > > > > > > [2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:600 -
> > > > *Marking
> > > > > > the
> > > > > > > > coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for
> > > group
> > > > > > > > otp-email-consumer*
> > > > > > > > [2018-08-21 11:12:46] :: INFO  ::
> > > > > > > > AbstractCoordinator$GroupCoordinatorResponseHandler:555 -
> > > > Discovered
> > > > > > > > coordinator 10.189.179.117:9092 (id: 2147483646 rack: null)
> > for
> > > > > group
> > > > > > > > otp-email-consumer.
> > > > > > > > [2018-08-21 11:12:46] :: INFO  :: AbstractCoordinator:381 -
> > > > > > (Re-)joining
> > > > > > > > group otp-email-consumer
> > > > > > > >
> > > > > > > > After this, the group enters rebalancing phase and it takes
> > about
> > > > > 5-10
> > > > > > > > minutes to start consuming messages again.
> > > > > > > > What does this message mean? The actual broker doesn't  go
> down
> > > as
> > > > > per
> > > > > > > our
> > > > > > > > monitoring tools. So how come it is declared dead? Please
> > help, I
> > > > am
> > > > > > > stuck
> > > > > > > > on this issue since 2 months now.
> > > > > > > >
> > > > > > > > Here's our consumer configuration
> > > > > > > > auto.commit.interval.ms = 3000
> > > > > > > > auto.offset.reset = latest
> > > > > > > > bootstrap.servers = [x.x.x.x:9092, x.x.x.x:9092,
> x.x.x.x:9092]
> > > > > > > > check.crcs = true
> > > > > > > > client.id =
> > > > > > > > connections.max.idle.ms = 540000
> > > > > > > > enable.auto.commit = true
> > > > > > > > exclude.internal.topics = true
> > > > > > > > fetch.max.bytes = 52428800
> > > > > > > > fetch.max.wait.ms = 500
> > > > > > > > fetch.min.bytes = 1
> > > > > > > > group.id = otp-notifications-consumer
> > > > > > > > heartbeat.interval.ms = 3000
> > > > > > > > interceptor.classes = null
> > > > > > > > key.deserializer = class
> org.apache.kafka.common.serialization.
> > > > > > > > StringDeserializer
> > > > > > > > max.partition.fetch.bytes = 1048576
> > > > > > > > max.poll.interval.ms = 300000
> > > > > > > > max.poll.records = 50
> > > > > > > > metadata.max.age.ms = 300000
> > > > > > > > metric.reporters = []
> > > > > > > > metrics.num.samples = 2
> > > > > > > > metrics.sample.window.ms = 30000
> > > > > > > > partition.assignment.strategy = [class
> > org.apache.kafka.clients.
> > > > > > > > consumer.RangeAssignor]
> > > > > > > > receive.buffer.bytes = 65536
> > > > > > > > reconnect.backoff.ms = 50
> > > > > > > > request.timeout.ms = 305000
> > > > > > > > retry.backoff.ms = 100
> > > > > > > > sasl.kerberos.kinit.cmd = /usr/bin/kinit
> > > > > > > > sasl.kerberos.min.time.before.relogin = 60000
> > > > > > > > sasl.kerberos.service.name = null
> > > > > > > > sasl.kerberos.ticket.renew.jitter = 0.05
> > > > > > > > sasl.kerberos.ticket.renew.window.factor = 0.8
> > > > > > > > sasl.mechanism = GSSAPI
> > > > > > > > security.protocol = SSL
> > > > > > > > send.buffer.bytes = 131072
> > > > > > > > session.timeout.ms = 300000
> > > > > > > > ssl.cipher.suites = null
> > > > > > > > ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
> > > > > > > > ssl.endpoint.identification.algorithm = null
> > > > > > > > ssl.key.password = null
> > > > > > > > ssl.keymanager.algorithm = SunX509
> > > > > > > > ssl.keystore.location = null
> > > > > > > > ssl.keystore.password = null
> > > > > > > > ssl.keystore.type = JKS
> > > > > > > > ssl.protocol = TLS
> > > > > > > > ssl.provider = null
> > > > > > > > ssl.secure.random.implementation = null
> > > > > > > > ssl.trustmanager.algorithm = PKIX
> > > > > > > > ssl.truststore.location = /x/x/client.truststore.jks
> > > > > > > > ssl.truststore.password = [hidden]
> > > > > > > > ssl.truststore.type = JKS
> > > > > > > > value.deserializer = class
> > org.apache.kafka.common.serialization.
> > > > > > > > StringDeserializer
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Frequent appearance of "Marking the coordinator dead" message in consumer log

Reply via email to