Manoj Govindassamy created HDFS-10819: -----------------------------------------
Summary: BlockManager fails to store a good block for a datanode storage after it reported a corrupt block — block replication stuck Key: HDFS-10819 URL: https://issues.apache.org/jira/browse/HDFS-10819 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 3.0.0-alpha1 Reporter: Manoj Govindassamy Assignee: Manoj Govindassamy TestDataNodeHotSwapVolumes occasionally fails in the unit test testRemoveVolumeBeingWrittenForDatanode. Data write pipeline can have issues as there could be timeouts, data node not reachable etc, and in this test case it was more of induced one as one of the volumes in a datanode is removed while block write is in progress. Digging further in the logs, when the problem happens in the write pipeline, the error recovery is not happening as expected leading to block replication never catching up. Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 44.495 sec <<< FAILURE! - in org.apache.hadoop.hdfs.serv testRemoveVolumeBeingWritten(org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes) Time elapsed: 44.354 se java.util.concurrent.TimeoutException: Timed out waiting for /test to reach 3 replicas Results : Tests in error: TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten:637->testRemoveVolumeBeingWrittenForDatanode:714 » Timeout Tests run: 1, Failures: 0, Errors: 1, Skipped: 0 Following exceptions are not expected in this test run {noformat} 614 2016-08-10 12:30:11,269 [DataXceiver for client DFSClient_NONMAPREDUCE_-640082112_10 at /127.0.0.1:58805 [Receiving block BP-1852988604-172.16.3.66-1470857409044:blk_1073741825_1001]] DEBUG datanode.Da taNode (DataXceiver.java:run(320)) - 127.0.0.1:58789:Number of active connections is: 2 615 java.lang.IllegalMonitorStateException 616 at java.lang.Object.wait(Native Method) 617 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:280) 618 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.removeVolumes(FsDatasetImpl.java:517) 619 at org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:832) 620 at org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:798) {noformat} {noformat} 720 2016-08-10 12:30:11,287 [DataNode: [[[DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/, [DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-projec t/hadoop-hdfs/target/test/data/dfs/data/data2/]] heartbeating to localhost/127.0.0.1:58788] ERROR datanode.DataNode (BPServiceActor.java:run(768)) - Exception in BPOfferService for Block pool BP-18529 88604-172.16.3.66-1470857409044 (Datanode Uuid 711d58ad-919d-4350-af1e-99fa0b061244) service to localhost/127.0.0.1:58788 721 java.lang.NullPointerException 722 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1841) 723 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:336) 724 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:624) 725 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:766) 726 at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org