[ https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon resolved HDFS-1260. ------------------------------- Resolution: Fixed Fix Version/s: (was: 0.20-append) 0.20.205.0 This was committed to 0.20.205, resolving JIRA > 0.20: Block lost when multiple DNs trying to recover it to different genstamps > ------------------------------------------------------------------------------ > > Key: HDFS-1260 > URL: https://issues.apache.org/jira/browse/HDFS-1260 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 0.20-append > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Critical > Fix For: 0.20.205.0 > > Attachments: HDFS-1260-20S.3.patch, hdfs-1260.txt, hdfs-1260.txt, > simultaneous-recoveries.txt > > > Saw this issue on a cluster where some ops people were doing network changes > without shutting down DNs first. So, recovery ended up getting started at > multiple different DNs at the same time, and some race condition occurred > that caused a block to get permanently stuck in recovery mode. What seems to > have happened is the following: > - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, > while the block in the volumeMap (and on filesystem) was genstamp 7093 > - we find the block file and meta file based on block ID only, without > comparing gen stamp > - we rename the meta file to the new genstamp _7094 > - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* > wildcard GS, so it does *not* update volumeMap > - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not > exist in blocks map" > After this point, all future recovery attempts to that node fail in > getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock > (since the meta file got renamed above) and then fails since _7094 isn't in > volumeMap in validateBlockMetadata > Making a unit test for this is probably going to be difficult, but doable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira