[ https://issues.apache.org/jira/browse/HDFS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Walter Su resolved HDFS-9143. ----------------------------- Resolution: Duplicate > updateCountForQuota method during EditlogTailer loadEdit can make SNN timeout > very often > ----------------------------------------------------------------------------------------- > > Key: HDFS-9143 > URL: https://issues.apache.org/jira/browse/HDFS-9143 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.4.0, 2.6.0 > Reporter: jiangyu > Priority: Minor > > I have seen many logs from datanodes in our cluster reporting socket timeout > when sending heartbeat or blockReceivedAndDeleted to Standby NameNode, but it > never happen to Active NameNode. > At first, i thought it maybe caused by Editlog Tailer fetch Editlog too much > making full gc, but after i watched the gc log, it is not. So i investigate > the code path and log, find it only take very few seconds for the SNN to > fetch the journal and merge it. But when you open the webpage of SNN during > merge processing, it can not response like stop the world time of full GC, > but there is no gc at that time. So i jstack SNN for some time, and finding > all the time consumed by updateCountForQuota method in FSImage. > The updateCountForQuota is called ervry time when loadEdits, it update the > count of each directory with quota in the namespace from ROOT, besides it > hold the write lock of FSImage, so every time when SNN merge the edit from > JN, it is always making the stop world. > I don't think it is necessary for SNN to updateCountForQuota everytime when > tail the edit, when trasition to Active, it call updateCountForQuota and > never missing any quota data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)