0.10.1.0 - KafkaConsumer.poll() blocks background heartbeat thread causing consumer to be considered dead?

Jaikiran Pai Wed, 02 Nov 2016 07:15:57 -0700

We have been trying to narrow down an issue in 0.10.1 of Kafka in oursetups where our consumers are marked as dead very frequently causingrebalances almost every few seconds. The consumer (Java new API) thenstarts seeing exceptions like:

org.apache.kafka.clients.consumer.CommitFailedException: Commit cannotbe completed since the group has already rebalanced and assigned thepartitions to another member. This means that the time betweensubsequent calls to poll() was longer than the configuredmax.poll.interval.ms, which typically implies that the poll loop isspending too much time message processing. You can address this eitherby increasing the session timeout or by reducing the maximum size ofbatches returned in poll() with max.poll.records.atorg.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:674)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:615)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:742)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:722)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:186)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:149)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:116)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:479)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:316)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:256)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:180)~[kafka-clients-0.10.1.0.jar!/:na]atorg.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:499)~[kafka-clients-0.10.1.0.jar!/:na]

Our session and heartbeat timeouts are defaults that ship in Kafka0.10.1 (i.e. we don't set any specific values). Every few seconds, wesee messages on the broker logs which indicate these consumers areconsidered dead:

[2016-11-02 06:09:48,103] TRACE [GroupCoordinator 0]: Memberconsumer-1-efde1e11-fdc6-4801-8fba-20d58b7a30b6 in group foo-bar hasfailed (kafka.coordinator.GroupCoordinator)[2016-11-02 06:09:48,103] INFO [GroupCoordinator 0]: Preparing torestabilize group foo-bar with old generation 1034(kafka.coordinator.GroupCoordinator)[2016-11-02 06:09:48,103] INFO [GroupCoordinator 0]: Group foo-bar withgeneration 1035 is now empty (kafka.coordinator.GroupCoordinator)

....

These messages keep repeating for almost every other consumer we have(in different groups).

There's no real logic in our consumers and they just pick up the messagefrom partitions, commit the offset, and hand it immediately to adifferent thread to process the message and go back to polling:


       while (!stopped) {
                try {

final ConsumerRecords<K, V> consumerRecords =consumer.poll(someValue);for (final TopicPartition topicPartition :consumerRecords.partitions()) {

                        if (stopped) {
                            break;
                        }

for (final ConsumerRecord<K, V> consumerRecord: consumerRecords.records(topicPartition)) {final long previousOffset =consumerRecord.offset();// commit the offset and then pass on themessage for processing (in a separate thread)consumer.commitSync(Collections.singletonMap(topicPartition, newOffsetAndMetadata(previousOffset + 1)));


                            this.executor.execute(new Runnable() {
                                @Override
                                public void run() {
                                    // process the ConsumerRecord
                                }
                            });
                        }
                    }
                } catch (Exception e) {
                    // log the error and continue
                    continue;
                }
            }

We haven't been able to figure out why the heartbeats wouldn't be sentby the consumer in the expected time period. From my understanding ofthe docs, the heartbeats are sent in the background thread for theconsumer, so there should be no real reason why these wouldn't be sent.

We debugged this a bit further and got some thread dumps from the JVM ofthe consumers and here's what we see:

"*kafka-coordinator-heartbeat-thread* | foo-bar #28 daemon prio=5os_prio=0 tid=0x00007f0d7c0ee000 nid=0x2e waiting for monitor entry[0x00007f0dd54c7000]

   java.lang.Thread.State: *BLOCKED* (on object monitor)

atorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disableWakeups(ConsumerNetworkClient.java:409)- *waiting to lock <0x00000000c0962bb0>* (aorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient)atorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:264)atorg.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:864)- locked <0x00000000c0962578> (aorg.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

So it looks like the heartbeat thread is *blocked* waiting for a objectlock and that lock is held by:

"thread-1" #27 daemon prio=5 os_prio=0 tid=0x00007f0dec3c1800 nid=0x27runnable [0x00007f0dcdffc000]

   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000c063b820> (a sun.nio.ch.Util$3)
    - locked <0x00000000c063b810> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000c05f9a70> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at org.apache.kafka.common.network.Selector.select(Selector.java:470)
    at org.apache.kafka.common.network.Selector.poll(Selector.java:286)
    at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:260)

atorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)- *locked* <*0x00000000c0962bb0*> (aorg.apache.kafka.clients.consumer.internals.ConsumerNetworkClient)atorg.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1031)atorg.apache.kafka.clients.consumer.*KafkaConsumer*.*poll*(KafkaConsumer.java:979)

    at org.myapp.KafkaMessageReceiver.start(KafkaMessageReceiver.java:72)

So it looks like the consumer code which invokes the*KafkaConsumer.poll*(...) API to fetch the messages is blocking theheartbeat sender thread? Is this intentional? If so, wouldn't this delaythe heartbeats being sent and cause the heartbeat task on thecoordinator to expire as per this logic is see on the coordinator:

private def shouldKeepMemberAlive(member: MemberMetadata,heartbeatDeadline: Long) =

    member.awaitingJoinCallback != null ||
      member.awaitingSyncCallback != null ||

* member.latestHeartbeat + member.sessionTimeoutMs >heartbeatDeadline**

from what I see and my limited understanding of this code, this wouldmark the member dead (as seen in the logs).

Is this expected that the background heart beat sender thread would beblocked by poll on the consumer (*our poll timeout is 2 minutes*)? Ordid I misread these logs and stacktraces? Let me know if morelogs/details are needed and I can get them.



-Jaikiran

0.10.1.0 - KafkaConsumer.poll() blocks background heartbeat thread causing consumer to be considered dead?

Reply via email to