Wei-Chiu Chuang created HDFS-11070: -------------------------------------- Summary: NPE in BlockSender due to race condition Key: HDFS-11070 URL: https://issues.apache.org/jira/browse/HDFS-11070 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Wei-Chiu Chuang
Saw the following NPE in a unit test: {quote} 2016-10-27 14:42:58,450 ERROR DataNode - 127.0.0.1:51987:DataXceiver error processing READ_BLOCK operation src: /127.0.0.1:52429 dst: /127.0.0.1:51987 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:284) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:581) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:150) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:102) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289) at java.lang.Thread.run(Thread.java:745) {quote} The NPE occurred here: {code:title=BlockSender.<init>} // Obtain a reference before reading data this.volumeRef = datanode.data.getVolume(block).obtainReference(); {code} Right before the NPE was a few lines of debug message that indicated the replica was appended and updated. {quote} 2016-10-27 14:42:58,442 DEBUG DataNode - block=BP-1071315328-172.16.1.88-1477604513635:blk_1073741825_1192, replica=FinalizedReplica, blk_1073741825_1192, FINALIZED getNumBytes() = 192 getBytesOnDisk() = 192 getVisibleLength()= 192 getVolume() = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1 getBlockURI() = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825 2016-10-27 14:42:58,442 INFO FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1192, FINALIZED getNumBytes() = 192 getBytesOnDisk() = 192 getVisibleLength()= 192 getVolume() = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1 getBlockURI() = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825 2016-10-27 14:42:58,442 DEBUG FsDatasetCache - Block with id 1073741825, pool BP-1071315328-172.16.1.88-1477604513635 does not need to be uncached, because it is not currently in the mappableBlockMap. 2016-10-27 14:42:58,450 DEBUG LocalReplica - Renaming /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825_1192.meta to /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825_1193.meta 2016-10-27 14:42:58,450 DEBUG LocalReplica - Renaming /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825 to /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825, file length=192 2016-10-27 14:42:58,450 DEBUG DataNode - writeTo blockfile is /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825 of size 192 2016-10-27 14:42:58,450 DEBUG DataNode - writeTo metafile is /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825_1193.meta of size 11 {quote} The block object's genstamp should have been the same as the ondisk replica. However, the log suggests the replica's genstamp may have been updated after: {code:title=BlockSender.<init>} try(AutoCloseableLock lock = datanode.data.acquireDatasetLock()) { replica = getReplica(block, datanode); replicaVisibleLength = replica.getVisibleLength(); } // if there is a write in progress ChunkChecksum chunkChecksum = null; if (replica.getState() == ReplicaState.RBW) { final ReplicaInPipeline rbw = (ReplicaInPipeline) replica; waitForMinLength(rbw, startOffset + length); chunkChecksum = rbw.getLastChecksumAndDataLen(); } {code} In summary, I think the assumption here is not valid, because a write-in-progress may happen after that check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org