[ https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746799#comment-15746799 ]
Apurva Mehta commented on KAFKA-4477: ------------------------------------- [~tdevoe], thanks for sharing all your extend broker logs, as well as the controller and state change logs. I have a few questions: # The original description in the ticket stats that the problem node reduces the ISR to itself, and then doesn't recover. In the logs you shared, the problem node 1002 does shrink its ISRs to itself, but then the ISR begins to expand back to the original set only 2 seconds after. The broker log for node 1002 also shows connections from the other replicas coming in. We can tell since the SASL handshake is being logged. the strange bit is that nodes 1001 and nodes 1003, however, can't seem to connect until 2130, which brings me to my next point. # Did you bounce the hosts at 2130? If not, when were the hosts bounced? # We have fixed some deadlock bugs where the ISR shrinks to a single node but expands back again. Given the observation in point 1, it maybe worth trying the 0.10.1.1 RC to see if you can reproduce this problem when using that code. If it reproduces, then we know for certain that the existing deadlocks are not the issue. # Another suspicion we have is the changes to the `NetworkClientBlockingOps` code. However, this code does not have any logging. If you try the RC, and still hit the issue, would you be willing to deploy a version of 0.10.1 with some instrumentation around the network client code. This would enable us to validate or disprove our hypothesis. Thanks, Apurva > Node reduces its ISR to itself, and doesn't recover. Other nodes do not take > leadership, cluster remains sick until node is restarted. > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-4477 > URL: https://issues.apache.org/jira/browse/KAFKA-4477 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.1.0 > Environment: RHEL7 > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) > Reporter: Michael Andre Pearce (IG) > Assignee: Apurva Mehta > Priority: Critical > Labels: reliability > Attachments: issue_node_1001.log, issue_node_1001_ext.log, > issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, > issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz > > > We have encountered a critical issue that has re-occured in different > physical environments. We haven't worked out what is going on. We do though > have a nasty work around to keep service alive. > We do have not had this issue on clusters still running 0.9.01. > We have noticed a node randomly shrinking for the partitions it owns the > ISR's down to itself, moments later we see other nodes having disconnects, > followed by finally app issues, where producing to these partitions is > blocked. > It seems only by restarting the kafka instance java process resolves the > issues. > We have had this occur multiple times and from all network and machine > monitoring the machine never left the network, or had any other glitches. > Below are seen logs from the issue. > Node 7: > [2016-12-01 07:01:28,112] INFO Partition > [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking > ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from > 1,2,7 to 7 (kafka.cluster.Partition) > All other nodes: > [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch > kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 > (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 7 was disconnected before the response was > read > All clients: > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.NetworkException: The server disconnected > before a response was received. > After this occurs, we then suddenly see on the sick machine an increasing > amount of close_waits and file descriptors. > As a work around to keep service we are currently putting in an automated > process that tails and regex's for: and where new_partitions hit just itself > we restart the node. > "\[(?P<time>.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for > partition \[.*\] from (?P<old_partitions>.+) to (?P<new_partitions>.+) > \(kafka.cluster.Partition\)" -- This message was sent by Atlassian JIRA (v6.3.4#6332)