Hi guys,

Situation:
  3 nodes, each 32G memory, CPU 24 cores, 1T hd.
  3 brokers on 3 nodes, and 3 zookeeper on these 3 nodes too, all the 
properties are default, start the zookeeper cluster and kafka cluster.
  Create a topic (3 replications, 6 partions), like below:
    bin/kafka-topics.sh --create --zookeeper hw168:2181 --replication-factor 3 
--partitions 6 --topic test
  And run the ProducerPerformance given by kafka on the two nodes at the same 
time, it means we have two producers, command like below:
    bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic 
test --num-records 100000000 --record-size 100 --throughput -1 --producer-props 
bootstrap.servers=hw168:9092 buffer.memory=67108864 batch.size=65536 acks=1

Problem:
  We can see from the producer, a lot of  NotLeaderForPartitionException:
    org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
    org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
    org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
    …

Track the process (by using DEBUG):
  There is a INFO: 
    INFO Client session timed out, have not heard from server in 11647ms for 
sessionid 0x258de4a26a40000, closing socket connection and attempting reconnect 
(org.apache.zookeeper.ClientCnxn)

  And We found that the connection between zkClient (kafka holds) and zookeeper 
server is closed by zookeeper server, the reason is that time is out, for 
details:
    [2016-12-08 20:24:00,547] DEBUG Partition [test,5] on broker 1: Skipping 
update high watermark since Old hw 15986847 [8012779 : 1068525112] is larger 
than new hw 15986847 [8012779 : 1068525112] for partition [test,5]. All leo's 
are 16566175 [16025299 : 72477384],15986847 [8012779 : 1068525112],16103549 
[16025299 : 10485500] (kafka.cluster.Partition)
    [2016-12-08 20:24:00,547] DEBUG Adding index entry 16566161 => 72475508 to 
00000000000016025299.index. (kafka.log.OffsetIndex)
    [2016-12-08 20:24:11,368] DEBUG [Replica Manager on Broker 1]: Request key 
test-2 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
    [2016-12-08 20:24:11,368] DEBUG Partition [test,2] on broker 1: Skipping 
update high watermark since Old hw 16064424 [16025299 : 5242750] is larger than 
new hw 16064424 [16025299 : 5242750] for partition [test,2]. All leo's are 
16566175 [16025299 : 72477384],16205274 [16025299 : 24116650],16064424 
[16025299 : 5242750] (kafka.cluster.Partition)
    [2016-12-08 20:24:11,369] DEBUG [Replica Manager on Broker 1]: Produce to 
local log in 10821 ms (kafka.server.ReplicaManager)
    [2016-12-08 20:24:11,369] INFO Client session timed out, have not heard 
from server in 11647ms for sessionid 0x258de4a26a40000, closing socket 
connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)

  Please watch the time, the there is no DEBUG between 20:24:00,547 and 
20:24:11,368, it already exceeded the time for session timeout (6000ms), so it 
causes this disconnection.  We keep digging:
  We found that it got stuck in the function:
      selector.select(waitTimeOut);     — in the method doTransport(…) in class 
org.apache.zookeeper.ClientCnxnSocketNIO
  and that is the time ww got no DEBUG.
  For more details, Call procedure (zookeeper client):
      org.apache.zookeeper.ClientCnxn -> run() -> doTransport(..)
  In the function run(), every time it will check whether there is a timeout, 
if not, it will run doTransport, but the doTransport costs about 10s, so next 
loop, it will find the timeout.

  Keep going, I thought there could be a deadlock at that time, so I keep 
printing the jstack of the kafka and zookeeper. Using the shell like below:
      while true; do echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 
12165` >> zkjstack; echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 
12425` >> kafkajstack; done
  And I check the period we got NO DEBUG in the stack logs, surprise, there 
also NO LOG at that time!!! Why?

So I’m confused that why it got stuck in that function? Why there is no DEBUG 
or LOG in that weird period? Please help me. Thank you all.
  

Xiaoyuan

Reply via email to