Manoj Govindassamy created HDFS-10819:
-----------------------------------------
Summary: BlockManager fails to store a good block for a datanode
storage after it reported a corrupt block — block replication stuck
Key: HDFS-10819
URL: https://issues.apache.org/jira/browse/HDFS-10819
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs
Affects Versions: 3.0.0-alpha1
Reporter: Manoj Govindassamy
Assignee: Manoj Govindassamy
TestDataNodeHotSwapVolumes occasionally fails in the unit test
testRemoveVolumeBeingWrittenForDatanode. Data write pipeline can have issues
as there could be timeouts, data node not reachable etc, and in this test case
it was more of induced one as one of the volumes in a datanode is removed while
block write is in progress. Digging further in the logs, when the problem
happens in the write pipeline, the error recovery is not happening as expected
leading to block replication never catching up.
Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 44.495 sec <<<
FAILURE! - in org.apache.hadoop.hdfs.serv
testRemoveVolumeBeingWritten(org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes)
Time elapsed: 44.354 se
java.util.concurrent.TimeoutException: Timed out waiting for /test to reach 3
replicas
Results :
Tests in error:
TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten:637->testRemoveVolumeBeingWrittenForDatanode:714
» Timeout
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
Following exceptions are not expected in this test run
{noformat}
614 2016-08-10 12:30:11,269 [DataXceiver for client
DFSClient_NONMAPREDUCE_-640082112_10 at /127.0.0.1:58805 [Receiving block
BP-1852988604-172.16.3.66-1470857409044:blk_1073741825_1001]] DEBUG datanode.Da
taNode (DataXceiver.java:run(320)) - 127.0.0.1:58789:Number of active
connections is: 2
615 java.lang.IllegalMonitorStateException
616 at java.lang.Object.wait(Native Method)
617 at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:280)
618 at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.removeVolumes(FsDatasetImpl.java:517)
619 at
org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:832)
620 at
org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:798)
{noformat}
{noformat}
720 2016-08-10 12:30:11,287 [DataNode:
[[[DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/,
[DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-projec
t/hadoop-hdfs/target/test/data/dfs/data/data2/]] heartbeating to
localhost/127.0.0.1:58788] ERROR datanode.DataNode
(BPServiceActor.java:run(768)) - Exception in BPOfferService for Block pool
BP-18529 88604-172.16.3.66-1470857409044 (Datanode Uuid
711d58ad-919d-4350-af1e-99fa0b061244) service to localhost/127.0.0.1:58788
721 java.lang.NullPointerException
722 at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1841)
723 at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:336)
724 at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:624)
725 at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:766)
726 at java.lang.Thread.run(Thread.java:745)
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]