Todd Lipcon created HDFS-3519:
---------------------------------

             Summary: Checkpoint upload may interfere with a concurrent 
saveNamespace
                 Key: HDFS-3519
                 URL: https://issues.apache.org/jira/browse/HDFS-3519
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: name-node
    Affects Versions: 2.0.0-alpha, 1.0.3
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon
            Priority: Critical


TestStandbyCheckpoints failed in [precommit build 
2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] due 
to the following issue:
- both nodes were in Standby state, and configured to checkpoint "as fast as 
possible"
- NN1 starts to save its own namespace
- NN2 starts to upload a checkpoint for the same txid. So, both threads are 
writing to the same file fsimage.ckpt_12, but the actual file contents 
correspond to the uploading thread's data.
- NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
it renamed the ckpt file. However, the contents of the file are still empty 
since NN2 hasn't sent any bytes
- NN2 finishes the upload, and the rename() call fails, which causes the 
directory to be marked failed, etc.

The result is that there is a file fsimage_12 which appears to be a finalized 
image but in fact is incompletely transferred. When the transfer completes, the 
problem "heals itself" so there wouldn't be persistent corruption unless the 
machine crashes at the same time. And even then, we'd still have the earlier 
checkpoint to restore from.

This same race could occur in a non-HA setup if a user puts the NN in safe mode 
and issues saveNamespace operations concurrent with a 2NN checkpointing, I 
believe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to