Running multiple 2NNs can result in corrupt file system
-------------------------------------------------------

                 Key: HDFS-2305
                 URL: https://issues.apache.org/jira/browse/HDFS-2305
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: name-node
    Affects Versions: 0.20.2
            Reporter: Aaron T. Myers
            Assignee: Aaron T. Myers


Here's the scenario:

* You run the NN and 2NN (2NN A) on the same machine.
* You don't have the address of the 2NN configured, so it's defaulting to 
127.0.0.1.
* There's another 2NN (2NN B) running on a second machine.
* When a 2NN is done checkpointing, it says "hey NN, I have an updated fsimage 
for you. You can download it from this URL, which includes my IP address, which 
is x"

And here's the steps that occur to cause this issue:

# Some edits happen.
# 2NN A (on the NN machine) does a checkpoint. All is dandy.
# Some more edits happen.
# 2NN B (on a different machine) does a checkpoint. It tells the NN "grab the 
newly-merged fsimage file from 127.0.0.1"
# NN happily grabs the fsimage from 2NN A (the 2NN on the NN machine), which is 
stale.
# NN renames edits.new file to edits. At this point the in-memory FS state is 
fine, but the on-disk state is missing edits.
# The next time a 2NN (any 2NN) tries to do a checkpoint, it gets an up-to-date 
edits file, with an outdated fsimage, and tries to apply those edits to that 
fsimage.
# Kaboom.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to