zhouyingchao created HDFS-8496:
----------------------------------

             Summary: Calling stopWriter() with FSDatasetImpl lock held may  
block other threads
                 Key: HDFS-8496
                 URL: https://issues.apache.org/jira/browse/HDFS-8496
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.6.0
            Reporter: zhouyingchao
            Assignee: zhouyingchao


On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
lock blocked everything.

Following is the heartbeat stack, as an example, to show how threads are 
blocked by FSDatasetImpl lock:
{code}
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
        - waiting to lock <0x00000007701badc0> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
        - locked <0x0000000770465dc0> (a java.lang.Object)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
        at java.lang.Thread.run(Thread.java:662)
{code}

The thread which held the FSDatasetImpl lock is just sleeping to wait another 
thread to exit in stopWriter(). The stack is:
{code}
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1194)
        - locked <0x00000007636953b8> (a org.apache.hadoop.util.Daemon)
        at 
org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
        - locked <0x00000007701badc0> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:662)
{code}

In this case, we deployed quite a lot other workloads on the DN, the local file 
system and disk is quite busy. We guess this is why the stopWriter took quite a 
long time.
Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
lock held.   In HDFS-7999, the createTemporary() is changed to call stopWriter 
without FSDatasetImpl lock. We guess we should do so in the other three 
methods: recoverClose()/recoverAppend/recoverRbw().

I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to