Michael Andre Pearce (IG) created KAFKA-4477:
------------------------------------------------

             Summary: Node reduces its ISR to itself, and doesn't recover. 
Other nodes do not take leadership, cluster remains sick until node is 
restarted.
                 Key: KAFKA-4477
                 URL: https://issues.apache.org/jira/browse/KAFKA-4477
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 0.10.1.0
         Environment: RHEL7

java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
            Reporter: Michael Andre Pearce (IG)
            Priority: Critical


We have encountered a critical issue that has re-occured in different physical 
environments. We haven't worked out what is going on. We do though have a nasty 
work around to keep service alive. 

We do have not had this issue on clusters still running 0.9.01.

We have noticed a node randomly shrinking for the partitions it owns the ISR's 
down to itself, moments later we see other nodes having disconnects, followed 
by finally app issues, where producing to these partitions is blocked.

It seems only by restarting the kafka instance java process resolves the issues.

We have had this occur multiple times and from all network and machine 
monitoring the machine never left the network, or had any other glitches.

Below are seen logs from the issue.

Node 7:
[2016-12-01 07:01:28,112] INFO Partition 
[com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking ISR 
for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 1,2,7 
to 7 (kafka.cluster.Partition)

All other nodes:
[2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 7 was disconnected before the response was 
read

All clients:
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.NetworkException: The server disconnected before 
a response was received.


After this occurs, we then suddenly see on the sick machine an increasing 
amount of close_waits and file descriptors.

As a work around to keep service we are currently putting in an automated 
process that tails and regex's for: and where new_partitions hit just itself we 
restart the node. 

"\[(?P<time>.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
partition \[.*\] from (?P<old_partitions>.+) to (?P<new_partitions>.+) 
\(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to