Wei-Chiu Chuang created HDFS-11070:
--------------------------------------

             Summary: NPE in BlockSender due to race condition
                 Key: HDFS-11070
                 URL: https://issues.apache.org/jira/browse/HDFS-11070
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
            Reporter: Wei-Chiu Chuang


Saw the following NPE in a unit test:
{quote}
2016-10-27 14:42:58,450 ERROR DataNode - 127.0.0.1:51987:DataXceiver error 
processing READ_BLOCK operation  src: /127.0.0.1:52429 dst: /127.0.0.1:51987
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:284)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:581)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:150)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:102)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289)
        at java.lang.Thread.run(Thread.java:745)
{quote}

The NPE occurred here:
{code:title=BlockSender.<init>}
      // Obtain a reference before reading data
      this.volumeRef = datanode.data.getVolume(block).obtainReference();
{code}

Right before the NPE was a few lines of debug message that indicated the 
replica was appended and updated.
{quote}
2016-10-27 14:42:58,442 DEBUG DataNode - 
block=BP-1071315328-172.16.1.88-1477604513635:blk_1073741825_1192, 
replica=FinalizedReplica, blk_1073741825_1192, FINALIZED
  getNumBytes()     = 192
  getBytesOnDisk()  = 192
  getVisibleLength()= 192
  getVolume()       = 
/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
  getBlockURI()     = 
file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825
2016-10-27 14:42:58,442 INFO  FsDatasetImpl - Appending to FinalizedReplica, 
blk_1073741825_1192, FINALIZED
  getNumBytes()     = 192
  getBytesOnDisk()  = 192
  getVisibleLength()= 192
  getVolume()       = 
/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
  getBlockURI()     = 
file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825
2016-10-27 14:42:58,442 DEBUG FsDatasetCache - Block with id 1073741825, pool 
BP-1071315328-172.16.1.88-1477604513635 does not need to be uncached, because 
it is not currently in the mappableBlockMap.
2016-10-27 14:42:58,450 DEBUG LocalReplica - Renaming 
/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825_1192.meta
 to 
/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825_1193.meta
2016-10-27 14:42:58,450 DEBUG LocalReplica - Renaming 
/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825
 to 
/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825,
 file length=192
2016-10-27 14:42:58,450 DEBUG DataNode - writeTo blockfile is 
/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825
 of size 192
2016-10-27 14:42:58,450 DEBUG DataNode - writeTo metafile is 
/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825_1193.meta
 of size 11
{quote}

The block object's genstamp should have been the same as the ondisk replica. 
However, the log suggests the replica's genstamp may have been updated after:
{code:title=BlockSender.<init>}
      try(AutoCloseableLock lock = datanode.data.acquireDatasetLock()) {
        replica = getReplica(block, datanode);
        replicaVisibleLength = replica.getVisibleLength();
      }
      // if there is a write in progress
      ChunkChecksum chunkChecksum = null;
      if (replica.getState() == ReplicaState.RBW) {
        final ReplicaInPipeline rbw = (ReplicaInPipeline) replica;
        waitForMinLength(rbw, startOffset + length);
        chunkChecksum = rbw.getLastChecksumAndDataLen();
      }
{code}

In summary, I think the assumption here is not valid, because a 
write-in-progress may happen after that check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to