Adam Binford created HDFS-17710:
-----------------------------------

             Summary: Standby node can load unpersisted edit from JournalNode 
cache
                 Key: HDFS-17710
                 URL: https://issues.apache.org/jira/browse/HDFS-17710
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: journal-node
    Affects Versions: 3.4.1
            Reporter: Adam Binford


A standby or observer node can load edits from the journal node that failed to 
be durably persisted. This can cause the standby or observer node to 
incorrectly think that the last committed transaction ID is higher than it 
actually is. This is the scenario that led us to find this:

We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby, and 
NN3 is observer. NN2 was failing to upload fsimage checkpoints to the other 
NameNodes, for reasons we are still investigating. But because a checkpoint was 
never able to be fully created, the JournalNodes could never cleanup old edit 
files. This led all 3 of our JournalNodes to slowly fill up and eventually run 
out of disk space. Because all the JournalNodes store effectively the same 
things, they all filled up at nearly the same time.

Since the JournalNodes could no longer write new transactions, NN1 and NN2 both 
started entering restart loops, since as soon as they finished booting up and 
out of safe mode, and the ZKFC made them active, they crashed after being 
unable to persist new transactions. NN3 stayed up in observer mode the whole 
time, never crashing as it never tried to write new transactions.

Because they are just on VMs, we simply increased the disk size of the 
JournalNodes to get them functioning again. After this, NN1 and NN2 were still 
in the process of booting up, so we put NN3 into standby mode so that the ZKFC 
could make it active right away, getting our system back online. After this, 
NN1 and NN2 failed to boot up do to a missing edits file on the journal nodes.

We believe this all stems from the fact that transactions are added to the edit 
cache on the journal nodes [before they are persisted to 
disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433].
 We think what happened is something like:

* Before disks filled up, NN1 successfully committed transaction 0096 to the 
Journal Nodes. 
* NN1 attempted to write transactions 0097 and 0098 to the journal nodes. These 
transactions got added to the edit cache, but then failed to persist to disk 
because the disk was full. The write failed on NN1 and it crashed and 
restarted. NN2 then became active and entered the same crash and restart loop.
* NN3 was tailing the edits, and the journal nodes all returned transactions 
0097 and 0098 from the edits cache. Because of this NN3 thinks that up to 
transaction 0098 have been durably persisted.
* Disk sizes are increased and journal nodes are able to write transactions 
again.
* NN3 becomes active, thinks that up to transaction 0098 have been committed, 
and begins writing new transactions starting at 0099, and the journal nodes 
update their committed transaction ID up to 0099.
* No journal nodes actually have transactions 0097 and 0098 written to disk, so 
when NN1 and NN2 start up, they fail to load edits from the journal node 
because the journals think it should have up through transactions 0099, but 
can't find any file with those edits.

I had to manually delete all edits files associated with any transaction >= 
0099, and manually edit the committed-txn file back to 0096 to finally get all 
the NameNodes to boot back up to a consistent state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to