[
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984341#comment-15984341
]
Naman Gupta commented on KAFKA-4477:
------------------------------------
Hi! All,
I am getting same issue kafka_2.12-0.10.2.0. But still getting this issue.Also
I didn't seen this bug as fixed in release notes as well.
Here are logs which I get from one of our node:
[2017-04-25 08:55:57,434] INFO [ReplicaFetcherManager on broker 72] Added
fetcher for partitions List([__consumer_offsets-45, initOffset 0 to
broker BrokerEndPoint(73,10.52.208.73,9092)] )
(kafka.server.ReplicaFetcherManager)
[2017-04-25 08:55:57,436] INFO [ReplicaFetcherManager on broker 72] Removed
fetcher for partitions DSLAM-1 (kafka.server.ReplicaFetcherManager)
[2017-04-25 08:55:57,437] INFO Truncating log DSLAM-1 to offset 3206058362.
(kafka.log.Log)
[2017-04-25 08:55:57,917] WARN [ReplicaFetcherThread-0-74], Error in fetch
kafka.server.ReplicaFetcherThread$FetchRequest@2632df26
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 74 was disconnected before the response was
read
at
kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114)
at
kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at
kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112)
at
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
at
kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142)
at
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at
kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249)
at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2017-04-25 08:55:58,026] INFO [ReplicaFetcherManager on broker 72] Added
fetcher for partitions List([DSLAM-1, initOffset 3206058362 to broker
BrokerEndPoint(73,10.52.208.73,9092)] ) (kafka.server.ReplicaFetcherManager)
> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take
> leadership, cluster remains sick until node is restarted.
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
> Reporter: Michael Andre Pearce (IG)
> Assignee: Apurva Mehta
> Priority: Critical
> Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: 2016_12_15.zip, issue_node_1001_ext.log,
> issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log,
> issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack,
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different
> physical environments. We haven't worked out what is going on. We do though
> have a nasty work around to keep service alive.
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the
> ISR's down to itself, moments later we see other nodes having disconnects,
> followed by finally app issues, where producing to these partitions is
> blocked.
> It seems only by restarting the kafka instance java process resolves the
> issues.
> We have had this occur multiple times and from all network and machine
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was
> read
> All clients:
> java.util.concurrent.ExecutionException:
> org.apache.kafka.common.errors.NetworkException: The server disconnected
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated
> process that tails and regex's for: and where new_partitions hit just itself
> we restart the node.
> "\[(?P<time>.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for
> partition \[.*\] from (?P<old_partitions>.+) to (?P<new_partitions>.+)
> \(kafka.cluster.Partition\)"
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)