fsck move should be non-destructive by default
----------------------------------------------

                 Key: HDFS-3044
                 URL: https://issues.apache.org/jira/browse/HDFS-3044
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: name-node
            Reporter: Eli Collins
            Assignee: Colin Patrick McCabe


The fsck move behavior in the code and originally articulated in HADOOP-101 is:

{quote}Current failure modes for DFS involve blocks that are completely 
missing. The only way to "fix" them would be to recover chains of blocks and 
put them into lost+found{quote}

A directory is created with the file name, the blocks that are accessible are 
created as individual files in this directory, then the original file is 
removed. 

I suspect the rationale for this behavior was that you can't use files that are 
missing locations, and copying the block as files at least makes part of the 
files accessible. However this behavior can also result in permanent dataloss. 
Eg:
- Some datanodes don't come up (eg due to a HW issues) and checkin on cluster 
startup, files with blocks where all replicas are on these set of datanodes are 
marked corrupt
- Admin does fsck move, which deletes the "corrupt" files, saves whatever 
blocks were available
- The HW issues with datanodes are resolved, they are started and join the 
cluster. The NN tells them to delete their blocks for the corrupt files since 
the file was deleted. 

I think we should:
- Make fsck move non-destructive by default (eg just does a move into 
lost+found)
- Make the destructive behavior optional (eg "--destructive" so admins think 
about what they're doing)
- Provide better sanity checks and warnings, eg if you're running fsck and not 
all the slaves have checked in (if using dfs.hosts) then fsck should print a 
warning indicating this that an admin should have to override if they want to 
do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to