rstest created HDFS-17932:
-----------------------------
Summary: SecondaryNameNode checkpoint stuck in retry infinitely
after rolling upgrade
Key: HDFS-17932
URL: https://issues.apache.org/jira/browse/HDFS-17932
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode, rolling upgrades
Affects Versions: 3.4.2, 3.3.6, 2.10.2
Reporter: rstest
# Summary
SecondaryNameNode checkpoint retry can fail indefinitely during non-HA rolling
upgrade after replaying `OP_ROLLING_UPGRADE_START` twice
# Bug Symptom
During a non-HA HDFS rolling upgrade from Hadoop 2.10.2 to Hadoop 3.3.6, the
SecondaryNameNode can become stuck failing checkpoint retries after a transient
NameNode RPC failure.
The failure occurs when the SecondaryNameNode has already replayed
`OP_ROLLING_UPGRADE_START` while merging a checkpoint, then the checkpoint
merge fails later when the SecondaryNameNode calls
`namenode.isRollingUpgrade()`. On retry, the same SecondaryNameNode process
reloads the checkpoint inputs and replays `OP_ROLLING_UPGRADE_START` again, but
its local in-memory `FSNamesystem` still has `rollingUpgradeInfo` active from
the failed first attempt.
The retry then fails with a `RollingUpgradeException`, because
`FSNamesystem.checkRollingUpgrade("start rolling upgrade")` rejects starting a
rolling upgrade while one is already in progress.
Expected behavior:
- A transient NameNode RPC failure during SecondaryNameNode checkpoint merge
should be recoverable.
- A checkpoint retry should not fail because stale in-memory rolling-upgrade
state from the failed merge attempt remains active.
Actual behavior:
- The running SecondaryNameNode checkpoint loop can remain stuck.
- Checkpoints stop being produced/uploaded by the SecondaryNameNode.
- Edit logs can continue accumulating until the SecondaryNameNode is restarted
or local checkpoint state is manually cleaned up.
Relevant code path:
- `SecondaryNameNode.doCheckpoint()` calls `doMerge(...)`.
- `SecondaryNameNode.doMerge()` calls
`Checkpointer.rollForwardByApplyingLogs(...)`.
- Edit replay sees `OP_ROLLING_UPGRADE_START`.
- `FSEditLogLoader` calls `fsNamesys.startRollingUpgradeInternal(startTime)`.
- `startRollingUpgradeInternal` sets `rollingUpgradeInfo` in the
SecondaryNameNode-local `FSNamesystem`.
- `doMerge()` saves the merged local fsimage, then calls
`namenode.isRollingUpgrade()`.
- If that RPC fails, `doCheckpoint()` calls `checkpointImage.setMergeError()`.
- Retry reloads/replays image+edits, but the local `rollingUpgradeInfo` state
is still active.
- Replaying `OP_ROLLING_UPGRADE_START` again throws because rolling upgrade is
already in progress.
Version pairs tested:
- Hadoop 2.10.2 -> Hadoop 3.3.6: issue observed.
- Hadoop 3.3.6 -> Hadoop 3.4.2: also covered by upgrade testing, but this
specific failure was observed on 2.10.2 -> 3.3.6.
# How To Reproduce
One way to reproduce is to force a transient NameNode RPC failure during the
narrow checkpoint window after the SecondaryNameNode has replayed
`OP_ROLLING_UPGRADE_START` but before `SecondaryNameNode.doMerge()` completes.
1. Start a non-HA HDFS cluster on Hadoop 2.10.2 with:
- one NameNode
- one SecondaryNameNode
- one DataNode
2. Prepare a rolling upgrade on the NameNode.
This should create a rollback image and write a rolling-upgrade START operation
into the edit log.
3. Upgrade/start the SecondaryNameNode on Hadoop 3.3.6 while the rolling
upgrade is prepared and not finalized.
4. Trigger or wait for a SecondaryNameNode checkpoint.
The SecondaryNameNode should download the rollback fsimage and edit logs, then
replay edits containing `OP_ROLLING_UPGRADE_START`.
5. After the SecondaryNameNode has replayed `OP_ROLLING_UPGRADE_START` and
saved the merged local checkpoint image, but before `doMerge()` completes, make
the NameNode RPC endpoint temporarily unavailable.
A practical way to induce this is to restart the NameNode at this point. The
observed failure point is the SecondaryNameNode RPC call to
`namenode.isRollingUpgrade()`, which can fail with an EOF/connection-closed
error if the NameNode is down.
6. Bring the NameNode back and let the same SecondaryNameNode process retry
checkpointing.
7. Observe that the retry reloads/replays the checkpoint inputs and encounters
`OP_ROLLING_UPGRADE_START` again.
8. The retry fails because the SecondaryNameNode-local `rollingUpgradeInfo`
from the failed previous merge attempt is still active.
Expected result:
- The SecondaryNameNode retry should recover from the transient RPC failure and
complete checkpointing.
Actual result:
- The retry fails with a rolling-upgrade already-in-progress exception.
- The same SecondaryNameNode process can continue failing future checkpoint
attempts until it is restarted or its local state is cleaned.
Representative exception:
```text
org.apache.hadoop.hdfs.protocol.RollingUpgradeException:
Failed to start rolling upgrade since a rolling upgrade is already in progress.
```
Potential fix direction:
- Ensure that checkpoint retry after `setMergeError()` fully resets or reloads
all SecondaryNameNode-local `FSNamesystem` state affected by edit replay,
including `rollingUpgradeInfo`.
- Alternatively, make replay of `OP_ROLLING_UPGRADE_START` idempotent in this
checkpoint-retry context when the existing rolling-upgrade info matches the
START operation being replayed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]