Adam Binford created HDFS-17710: ----------------------------------- Summary: Standby node can load unpersisted edit from JournalNode cache Key: HDFS-17710 URL: https://issues.apache.org/jira/browse/HDFS-17710 Project: Hadoop HDFS Issue Type: Bug Components: journal-node Affects Versions: 3.4.1 Reporter: Adam Binford
A standby or observer node can load edits from the journal node that failed to be durably persisted. This can cause the standby or observer node to incorrectly think that the last committed transaction ID is higher than it actually is. This is the scenario that led us to find this: We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby, and NN3 is observer. NN2 was failing to upload fsimage checkpoints to the other NameNodes, for reasons we are still investigating. But because a checkpoint was never able to be fully created, the JournalNodes could never cleanup old edit files. This led all 3 of our JournalNodes to slowly fill up and eventually run out of disk space. Because all the JournalNodes store effectively the same things, they all filled up at nearly the same time. Since the JournalNodes could no longer write new transactions, NN1 and NN2 both started entering restart loops, since as soon as they finished booting up and out of safe mode, and the ZKFC made them active, they crashed after being unable to persist new transactions. NN3 stayed up in observer mode the whole time, never crashing as it never tried to write new transactions. Because they are just on VMs, we simply increased the disk size of the JournalNodes to get them functioning again. After this, NN1 and NN2 were still in the process of booting up, so we put NN3 into standby mode so that the ZKFC could make it active right away, getting our system back online. After this, NN1 and NN2 failed to boot up do to a missing edits file on the journal nodes. We believe this all stems from the fact that transactions are added to the edit cache on the journal nodes [before they are persisted to disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433]. We think what happened is something like: * Before disks filled up, NN1 successfully committed transaction 0096 to the Journal Nodes. * NN1 attempted to write transactions 0097 and 0098 to the journal nodes. These transactions got added to the edit cache, but then failed to persist to disk because the disk was full. The write failed on NN1 and it crashed and restarted. NN2 then became active and entered the same crash and restart loop. * NN3 was tailing the edits, and the journal nodes all returned transactions 0097 and 0098 from the edits cache. Because of this NN3 thinks that up to transaction 0098 have been durably persisted. * Disk sizes are increased and journal nodes are able to write transactions again. * NN3 becomes active, thinks that up to transaction 0098 have been committed, and begins writing new transactions starting at 0099, and the journal nodes update their committed transaction ID up to 0099. * No journal nodes actually have transactions 0097 and 0098 written to disk, so when NN1 and NN2 start up, they fail to load edits from the journal node because the journals think it should have up through transactions 0099, but can't find any file with those edits. I had to manually delete all edits files associated with any transaction >= 0099, and manually edit the committed-txn file back to 0096 to finally get all the NameNodes to boot back up to a consistent state. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org