[ https://issues.apache.org/jira/browse/KAFKA-5154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015615#comment-16015615 ]
Damian Guy commented on KAFKA-5154: ----------------------------------- [~Lukas Gemela] Thanks for attaching the full logs. [~mjsax] [~guozhang] i've been through the logs and extracted the relevant section. It is attached as 5154_error.log for reference. What it looks like is happening is the heartbeat fails as the group is rebalancing. {{StreamThread.onPartitionsRevoked}} is called and the tasks are cleared. The thread is trying to rejoin the group, but it looks like it is the only member. There are some issues communicating with the cluster. This member becomes the Leader for the group and starts doing partition assignment. Eventually we can see that {{StreamPartitionAssignor}} assigns all partitions to the single client: {noformat} Assigned tasks to clients as {8d510649-ddf3-41f1-86c9-c5b3c2c7a1b5=[activeTasks: ([0_0, 0_1, 0_2, 0_3, 0_4, 0_5, 0_6, 0_7, 0_8, 0_9, 0_10, 0_11, 0_12, 0_13, 0_14, 0_15, 0_16, 0_17, 0_18, 0_19, 0_20, 0_21, 0_22, 0_23, 0_24, 0_25, 0_26, 0_27, 0_28, 0_29, 0_30, 0_31, 0_32, 0_33, 0_34, 0_35, 0_36, 0_37, 0_38, 0_39]) assignedTasks: ([0_0, 0_1, 0_2, 0_3, 0_4, 0_5, 0_6, 0_7, 0_8, 0_9, 0_10, 0_11, 0_12, 0_13, 0_14, 0_15, 0_16, 0_17, 0_18, 0_19, 0_20, 0_21, 0_22, 0_23, 0_24, 0_25, 0_26, 0_27, 0_28, 0_29, 0_30, 0_31, 0_32, 0_33, 0_34, 0_35, 0_36, 0_37, 0_38, 0_39]) {noformat} The SYNC_GROUP request gets cancelled and the coordinator marked as dead, but.. we are still sending fetch requests for the partitions that were previously revoked. {noformat} [[36m2017-05-07T00:02:03,402 DEBUG StreamThread-1 org.apache.kafka.clients.consumer.internals.Fetcher.sendFetches() @180 - Sending fetch for partitions [poseidonIncidentFeed-12, poseidonIncidentFeed-21, poseidonIncidentFeed-6] to broker 10.210.200.171:9092 (id: 1 rack: null) [[36m2017-05-07T00:02:03,402 DEBUG StreamThread-1 org.apache.kafka.clients.consumer.internals.Fetcher.sendFetches() @180 - Sending fetch for partitions [poseidonIncidentFeed-38] to broker 10.210.200.144:9092 (id: 3 rack: null) {noformat} So the {{StreamThread}} is still running with partitions that have been revoked, hence the {{activeTasksByPartition}} is empty causing the NPE. Now to try and figure out how this can happen. > Kafka Streams throws NPE during rebalance > ----------------------------------------- > > Key: KAFKA-5154 > URL: https://issues.apache.org/jira/browse/KAFKA-5154 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 0.10.2.0 > Reporter: Lukas Gemela > Assignee: Matthias J. Sax > Attachments: 5154_problem.log, clio_afa596e9b809.gz, clio_reduced.gz, > clio.txt.gz > > > please see attached log, Kafka streams throws NullPointerException during > rebalance, which is caught by our custom exception handler > {noformat} > 2017-04-30T17:44:17,675 INFO kafka-coordinator-heartbeat-thread | hades > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead() > @618 - Marking the coordinator 10.210.200.144:9092 (id: 2147483644 rack: > null) dead for group hades > 2017-04-30T17:44:27,395 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onSuccess() > @573 - Discovered coordinator 10.210.200.144:9092 (id: 2147483644 rack: null) > for group hades. > 2017-04-30T17:44:27,941 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare() > @393 - Revoking previously assigned partitions [poseidonIncidentFeed-27, > poseidonIncidentFeed-29, poseidonIncidentFeed-30, poseidonIncidentFeed-18] > for group hades > 2017-04-30T17:44:27,947 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendJoinGroupRequest() > @407 - (Re-)joining group hades > 2017-04-30T17:44:48,468 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendJoinGroupRequest() > @407 - (Re-)joining group hades > 2017-04-30T17:44:53,628 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendJoinGroupRequest() > @407 - (Re-)joining group hades > 2017-04-30T17:45:09,587 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendJoinGroupRequest() > @407 - (Re-)joining group hades > 2017-04-30T17:45:11,961 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onSuccess() > @375 - Successfully joined group hades with generation 99 > 2017-04-30T17:45:13,126 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete() > @252 - Setting newly assigned partitions [poseidonIncidentFeed-11, > poseidonIncidentFeed-27, poseidonIncidentFeed-25, poseidonIncidentFeed-29, > poseidonIncidentFeed-19, poseidonIncidentFeed-18] for group hades > 2017-04-30T17:46:37,254 INFO kafka-coordinator-heartbeat-thread | hades > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead() > @618 - Marking the coordinator 10.210.200.144:9092 (id: 2147483644 rack: > null) dead for group hades > 2017-04-30T18:04:25,993 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onSuccess() > @573 - Discovered coordinator 10.210.200.144:9092 (id: 2147483644 rack: null) > for group hades. > 2017-04-30T18:04:29,401 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare() > @393 - Revoking previously assigned partitions [poseidonIncidentFeed-11, > poseidonIncidentFeed-27, poseidonIncidentFeed-25, poseidonIncidentFeed-29, > poseidonIncidentFeed-19, poseidonIncidentFeed-18] for group hades > 2017-04-30T18:05:10,877 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendJoinGroupRequest() > @407 - (Re-)joining group hades > 2017-05-01T00:01:55,707 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead() > @618 - Marking the coordinator 10.210.200.144:9092 (id: 2147483644 rack: > null) dead for group hades > 2017-05-01T00:01:59,027 INFO StreamThread-1 > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onSuccess() > @573 - Discovered coordinator 10.210.200.144:9092 (id: 2147483644 rack: null) > for group hades. > 2017-05-01T00:01:59,031 ERROR StreamThread-1 > org.apache.kafka.streams.processor.internals.StreamThread.run() @376 - > stream-thread [StreamThread-1] Streams application error during processing: > java.lang.NullPointerException > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:619) > ~[kafka-streams-0.10.2.0.jar!/:?] > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368) > [kafka-streams-0.10.2.0.jar!/:?] > 2017-05-01T00:02:00,038 INFO StreamThread-1 > org.apache.kafka.clients.producer.KafkaProducer.close() @689 - Closing the > Kafka producer with timeoutMillis = 9223372036854775807 ms. > 2017-05-01T00:02:00,949 WARN StreamThread-1 > org.apache.kafka.streams.processor.internals.StreamThread.setState() @160 - > Unexpected state transition from PARTITIONS_REVOKED to NOT_RUNNING > 2017-05-01T00:02:00,951 ERROR StreamThread-1 > com.williamhill.trading.platform.hades.kafka.KafkaStreamManager.uncaughtException() > @104 - UncaughtException in thread StreamThread-1, stopping kafka streams > java.lang.NullPointerException > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:619) > ~[kafka-streams-0.10.2.0.jar!/:?] > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368) > ~[kafka-streams-0.10.2.0.jar!/:?] > 2017-05-01T00:02:01,076 WARN kafka-streams-close-thread > org.apache.kafka.streams.processor.internals.StreamThread.setState() @160 - > Unexpected state transition from NOT_RUNNING to PENDING_SHUTDOWN > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)