[ https://issues.apache.org/jira/browse/HDFS-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Nauroth resolved HDFS-4811. --------------------------------- Resolution: Duplicate I've reassigned HDFS-3519 to myself, and I'm resolving HDFS-4811 as duplicate. Thanks, Todd and Andrew. > race condition between 2 namenodes in standby that are trying to checkpoint > with one another can delete or corrupt a good fsimage > --------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-4811 > URL: https://issues.apache.org/jira/browse/HDFS-4811 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 3.0.0, 2.0.5-beta > Reporter: Chris Nauroth > > The problem occurs under concurrent execution of the namenode running its own > checkpoint in {{StandbyCheckpointer}} in thread 1 while also getting a > checkpoint from a different namenode in {{GetImageServlet}} in thread 2. It > is possible for thread 2 to finish writing the checkpoint to the directory, > but then get suspended before it has a chance to rename it to its final > destination as an fsimage file. Then, thread 1 wakes up and starts writing > its own data to the checkpoint file. When thread 2 resumes, it then tries to > rename the file that thread 1 still holds open for writing. Depending on OS, > this either moves thread 1's incomplete checkpoint to fsimage, or it just > outright deletes the existing good fsimage until thread 1 finishes writing > and renames. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira