I know average time of processing one record, it is about 70-80ms. I have set session.timeout.ms so high total processing time for one poll invocation should be well within it.
On Wed, Aug 22, 2018 at 5:04 PM Steve Tian <steve.cs.t...@gmail.com> wrote: > Have you measured the duration between two `poll` invocations and the size > of returned `ConsumrRecords`? > > On Wed, Aug 22, 2018, 7:00 PM Shantanu Deshmukh <shantanu...@gmail.com> > wrote: > > > Ohh sorry, my bad. Kafka version is 0.10.1.0 indeed and so is the client. > > > > On Wed, Aug 22, 2018 at 4:26 PM Steve Tian <steve.cs.t...@gmail.com> > > wrote: > > > > > NVM. What's your client version? I'm asking as max.poll.interval.ms > > > should be introduced since 0.10.1.0, which is not the version you > > mentioned > > > in the email thread. > > > > > > On Wed, Aug 22, 2018, 6:51 PM Shantanu Deshmukh <shantanu...@gmail.com > > > > > wrote: > > > > > > > How do I check for GC pausing? > > > > > > > > On Wed, Aug 22, 2018 at 4:12 PM Steve Tian <steve.cs.t...@gmail.com> > > > > wrote: > > > > > > > > > Did you observed any GC-pausing? > > > > > > > > > > On Wed, Aug 22, 2018, 6:38 PM Shantanu Deshmukh < > > shantanu...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > Hi Steve, > > > > > > > > > > > > Application is just sending mails. Every record is just a email > > > request > > > > > > with very basic business logic. Generally it doesn't take more > than > > > > 200ms > > > > > > to process a single mail. Currently it is averaging out at 70-80 > > ms. > > > > > > > > > > > > On Wed, Aug 22, 2018 at 3:06 PM Steve Tian < > > steve.cs.t...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > How long did it take to process 50 `ConsumerRecord`s? > > > > > > > > > > > > > > On Wed, Aug 22, 2018, 5:16 PM Shantanu Deshmukh < > > > > shantanu...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > We have Kafka 0.10.0.1 running on a 3 broker cluster. We have > > an > > > > > > > > application which consumes from a topic having 10 partitions. > > 10 > > > > > > > consumers > > > > > > > > are spawned from this process, they belong to one consumer > > group. > > > > > > > > > > > > > > > > What we have observed is that very frequently we are > observing > > > such > > > > > > > > messages in consumer logs > > > > > > > > > > > > > > > > [2018-08-21 11:12:46] :: WARN :: ConsumerCoordinator:554 - > > Auto > > > > > offset > > > > > > > > commit failed for group otp-email-consumer: Commit cannot be > > > > > completed > > > > > > > > since the group has already rebalanced and assigned the > > > partitions > > > > to > > > > > > > > another member. This means that the time between subsequent > > calls > > > > to > > > > > > > poll() > > > > > > > > was longer than the configured max.poll.interval.ms, which > > > > typically > > > > > > > > implies that the poll loop is spending too much time message > > > > > > processing. > > > > > > > > You can address this either by increasing the session timeout > > or > > > by > > > > > > > > reducing the maximum size of batches returned in poll() with > > > > > > > > max.poll.records. > > > > > > > > [2018-08-21 11:12:46] :: INFO :: ConsumerCoordinator:333 - > > > > Revoking > > > > > > > > previously assigned partitions [otp-email-1, otp-email-0, > > > > > otp-email-3, > > > > > > > > otp-email-2] for group otp-email-consumer > > > > > > > > [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator:381 - > > > > > > (Re-)joining > > > > > > > > group otp-email-consumer > > > > > > > > [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator:600 - > > > > *Marking > > > > > > the > > > > > > > > coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for > > > group > > > > > > > > otp-email-consumer* > > > > > > > > [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator:600 - > > > > *Marking > > > > > > the > > > > > > > > coordinator x.x.x.x:9092 (id: 2147483646 rack: null) dead for > > > group > > > > > > > > otp-email-consumer* > > > > > > > > [2018-08-21 11:12:46] :: INFO :: > > > > > > > > AbstractCoordinator$GroupCoordinatorResponseHandler:555 - > > > > Discovered > > > > > > > > coordinator 10.189.179.117:9092 (id: 2147483646 rack: null) > > for > > > > > group > > > > > > > > otp-email-consumer. > > > > > > > > [2018-08-21 11:12:46] :: INFO :: AbstractCoordinator:381 - > > > > > > (Re-)joining > > > > > > > > group otp-email-consumer > > > > > > > > > > > > > > > > After this, the group enters rebalancing phase and it takes > > about > > > > > 5-10 > > > > > > > > minutes to start consuming messages again. > > > > > > > > What does this message mean? The actual broker doesn't go > down > > > as > > > > > per > > > > > > > our > > > > > > > > monitoring tools. So how come it is declared dead? Please > > help, I > > > > am > > > > > > > stuck > > > > > > > > on this issue since 2 months now. > > > > > > > > > > > > > > > > Here's our consumer configuration > > > > > > > > auto.commit.interval.ms = 3000 > > > > > > > > auto.offset.reset = latest > > > > > > > > bootstrap.servers = [x.x.x.x:9092, x.x.x.x:9092, > x.x.x.x:9092] > > > > > > > > check.crcs = true > > > > > > > > client.id = > > > > > > > > connections.max.idle.ms = 540000 > > > > > > > > enable.auto.commit = true > > > > > > > > exclude.internal.topics = true > > > > > > > > fetch.max.bytes = 52428800 > > > > > > > > fetch.max.wait.ms = 500 > > > > > > > > fetch.min.bytes = 1 > > > > > > > > group.id = otp-notifications-consumer > > > > > > > > heartbeat.interval.ms = 3000 > > > > > > > > interceptor.classes = null > > > > > > > > key.deserializer = class > org.apache.kafka.common.serialization. > > > > > > > > StringDeserializer > > > > > > > > max.partition.fetch.bytes = 1048576 > > > > > > > > max.poll.interval.ms = 300000 > > > > > > > > max.poll.records = 50 > > > > > > > > metadata.max.age.ms = 300000 > > > > > > > > metric.reporters = [] > > > > > > > > metrics.num.samples = 2 > > > > > > > > metrics.sample.window.ms = 30000 > > > > > > > > partition.assignment.strategy = [class > > org.apache.kafka.clients. > > > > > > > > consumer.RangeAssignor] > > > > > > > > receive.buffer.bytes = 65536 > > > > > > > > reconnect.backoff.ms = 50 > > > > > > > > request.timeout.ms = 305000 > > > > > > > > retry.backoff.ms = 100 > > > > > > > > sasl.kerberos.kinit.cmd = /usr/bin/kinit > > > > > > > > sasl.kerberos.min.time.before.relogin = 60000 > > > > > > > > sasl.kerberos.service.name = null > > > > > > > > sasl.kerberos.ticket.renew.jitter = 0.05 > > > > > > > > sasl.kerberos.ticket.renew.window.factor = 0.8 > > > > > > > > sasl.mechanism = GSSAPI > > > > > > > > security.protocol = SSL > > > > > > > > send.buffer.bytes = 131072 > > > > > > > > session.timeout.ms = 300000 > > > > > > > > ssl.cipher.suites = null > > > > > > > > ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] > > > > > > > > ssl.endpoint.identification.algorithm = null > > > > > > > > ssl.key.password = null > > > > > > > > ssl.keymanager.algorithm = SunX509 > > > > > > > > ssl.keystore.location = null > > > > > > > > ssl.keystore.password = null > > > > > > > > ssl.keystore.type = JKS > > > > > > > > ssl.protocol = TLS > > > > > > > > ssl.provider = null > > > > > > > > ssl.secure.random.implementation = null > > > > > > > > ssl.trustmanager.algorithm = PKIX > > > > > > > > ssl.truststore.location = /x/x/client.truststore.jks > > > > > > > > ssl.truststore.password = [hidden] > > > > > > > > ssl.truststore.type = JKS > > > > > > > > value.deserializer = class > > org.apache.kafka.common.serialization. > > > > > > > > StringDeserializer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >