[
https://issues.apache.org/jira/browse/HDDS-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886264#comment-17886264
]
Tsz-wo Sze edited comment on HDDS-10798 at 10/2/24 6:46 PM:
------------------------------------------------------------
For the cluster with HDDS-10546 but without HDDS-10798, OM leader can remain
not ready indefinitely. Below are some details:
- Step 0: OM startup
{code:java}
ratisApplied: 9 // the applied index in Ratis
ozoneApplied: 9 // the applied index in Ozone
lastSkipped : -1 // the last index skipped in notifyTermIndexUpdated(..)
lastNotified: -1 // the last index passed to notifyTermIndexUpdated(..)
{code}
-* Note 1: ratisApplied and ozoneApplied can be different due to the OM double
buffer. In OzoneManagerStateMachine.applyTransaction(..), it stores the
transaction to the double buffer (instead of applying it) and then return
complete to ratis. As a result, ratisApplied will be incremented but
ozoneApplied will not until double buffer flush.
-* Note 2: Only the indices of non-state machine log entries will be passed to
notifyTermIndexUpdated(..). Suppose the indices passed to
notifyTermIndexUpdated(..) are
{code}204, (), 206, (), (), 209, 210. {code}
Then, lastSkipped will be 208 and lastNotified will be 210.
- Step 1: applyTransaction(logEntryIndex=10)
{code:java}
double buffer: 10
ozoneApplied : 9 [unchanged: double buffer is not yet flushed]
ratisApplied : 9 -> 10
{code}
- Step 2: OM becomes the Leader and writes STARTUP_ENTRY with index=11
{code:java}
Ratis apply: CONF_ENTRY with index=11
ratisApplied : 10 -> 11
notifyTermIndexUpdated(newIndex=11)
lastSkipped : -1 -> 10
lastNotified : -1 -> 11
updateLastAppliedTermIndex? no since lastNotified(11) - ozoneApplied(9) != 1
{code}
- Step 3: (optional)
{code:java}
Ratis apply: META_ENTRY with index=12
ratisApplied : 11 -> 12
notifyTermIndexUpdated(newIndex=12)
lastSkipped : 10 [unchanged: newIndex(12) - lastNotified(11) = 1]
lastNotified : 11 -> 12
updateLastAppliedTermIndex? no since lastNotified(12) - ozoneApplied(9) != 1
{code}
- Step 4: BUG
{code:java}
Doube buffer flush: 10
updateLastAppliedIndex(newTermIndex=10)
C1: newTermIndex(10) < lastNotified(12) is true
C2: ozoneApplied(9) >= lastSkipped(10) is false
newTermIndex: 10 (unchanged)
ozoneApplied : 9 -> 10
The Leader remains not ready since ozoneApplied(9) < STARTUP_ENTRY(11)
{code}
- Step 4': FIX, in C2, use newTermIndex instead of ozoneApplied
{code:java}
Doube buffer flush: 10
updateLastAppliedIndex(newTermIndex=10)
C1: newTermIndex(10) < lastNotified(12) is true
C2: newTermIndex(10) >= lastSkipped(10) is true
newTermIndex: 10 -> 12
ozoneApplied : 9 -> 12
The Leader becomes ready since ozoneApplied(12) >= STARTUP_ENTRY(11)
{code}
was (Author: szetszwo):
For the cluster with HDDS-10546 but without HDDS-10798, OM leader can remain
not ready indefinitely. Below are some details:
- Step 0: OM startup
{code:java}
ratisApplied: 9
ozoneApplied: 9
lastSkipped : -1
lastNotified: -1
{code}
- Step 1: applyTransaction(logEntryIndex=10)
{code:java}
double buffer: 10
ozoneApplied : 9 [unchanged: double buffer is not yet flushed]
ratisApplied : 9 -> 10
{code}
- Step 2: OM becomes the Leader and writes STARTUP_ENTRY with index=11
{code:java}
Ratis apply: CONF_ENTRY with index=11
ratisApplied : 10 -> 11
notifyTermIndexUpdated(newIndex=11)
lastSkipped : -1 -> 10
lastNotified : -1 -> 11
updateLastAppliedTermIndex? no since lastNotified(11) - ozoneApplied(9) != 1
{code}
- Step 3: (optional)
{code:java}
Ratis apply: META_ENTRY with index=12
ratisApplied : 11 -> 12
notifyTermIndexUpdated(newIndex=12)
lastSkipped : 10 [unchanged: newIndex(12) - lastNotified(11) = 1]
lastNotified : 11 -> 12
updateLastAppliedTermIndex? no since lastNotified(12) - ozoneApplied(9) != 1
{code}
- Step 4: BUG
{code:java}
Doube buffer flush: 10
updateLastAppliedIndex(newTermIndex=10)
C1: newTermIndex(10) < lastNotified(12) is true
C2: ozoneApplied(9) >= lastSkipped(10) is false
newTermIndex: 10 (unchanged)
ozoneApplied : 9 -> 10
The Leader remains not ready since ozoneApplied(9) < STARTUP_ENTRY(11)
{code}
- Step 4': FIX, in C2, use newTermIndex instead of ozoneApplied
{code:java}
Doube buffer flush: 10
updateLastAppliedIndex(newTermIndex=10)
C1: newTermIndex(10) < lastNotified(12) is true
C2: newTermIndex(10) >= lastSkipped(10) is true
newTermIndex: 10 -> 12
ozoneApplied : 9 -> 12
The Leader becomes ready since ozoneApplied(12) >= STARTUP_ENTRY(11)
{code}
> OMLeaderNotReadyException exception on switch leader
> ----------------------------------------------------
>
> Key: HDDS-10798
> URL: https://issues.apache.org/jira/browse/HDDS-10798
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM HA
> Reporter: Sumit Agrawal
> Assignee: Sumit Agrawal
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Client is recieving exception as LeaderNotReady;
> {code:java}
> 2024-05-02 13:54:07,941 DEBUG [IPC Server handler 70 on
> 9862]-org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB:
> om72 is Leader but not ready to process request yet.{code}
>
> As part of fix HDDS-10546, one of scenario is missing,
> * notifyTermIndexUpdate set lastSkippedIndex as few transaction still in
> double buffer
> * doubleBuffer notify update index does not update lastNotifiedTermIndex as
> check 'lastApplied.getIndex() >= lastSkippedIndex' fails, as lastApplied is
> much older value
> This is random issue where, When election happens and there are transaction
> in double buffer, this can impact not updating notified transactionId. This
> can be recovered after restart of OM.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]