jiangyu created HDFS-9143:
-----------------------------

             Summary: updateCountForQuota method during EditlogTailer loadEdit 
can make SNN timeout very often 
                 Key: HDFS-9143
                 URL: https://issues.apache.org/jira/browse/HDFS-9143
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.6.0, 2.4.0
            Reporter: jiangyu
            Priority: Minor


I have seen many logs from datanodes in our cluster reporting socket timeout 
when sending heartbeat or blockReceivedAndDeleted to Standby NameNode, but it 
never happen to Active NameNode.  
At first, i thought it maybe caused by Editlog Tailer fetch Editlog too much 
making full gc, but after i watched the gc log, it is not. So i investigate the 
code path and log, find it only take very few seconds for the SNN to fetch the 
journal and merge it. But when you open the webpage of SNN during merge 
processing, it can not response  like stop the world time of full GC, but there 
is no gc at that time. So i jstack SNN for some time, and finding all the time 
consumed by updateCountForQuota method in FSImage.  
The updateCountForQuota is called ervry time when loadEdits, it update the 
count of each directory with quota in the namespace from ROOT, besides it hold 
the write lock of FSImage, so every time when SNN merge the edit from JN, it is 
always making the stop world.  
I don't think it is necessary for SNN to updateCountForQuota everytime when 
tail the edit, when trasition to Active, it call updateCountForQuota and never 
missing any quota data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to