[ 
https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28638:
---------------------------------
    Summary: Fail-fast retry limit for specific errors to recover from remote 
procedure failure using server crash  (was: Impose retry limit for specific 
errors to recover from remote procedure failure using server crash)

> Fail-fast retry limit for specific errors to recover from remote procedure 
> failure using server crash
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-28638
>                 URL: https://issues.apache.org/jira/browse/HBASE-28638
>             Project: HBase
>          Issue Type: Sub-task
>          Components: amv2, master, Region Assignment
>    Affects Versions: 3.0.0-beta-1, 2.6.1, 2.5.10
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.5.11, 2.6.2
>
>
> As per one of the recent incidents, some regions faced 5+ minute of 
> availability drop because before active master could initiate SCP for the 
> dead server, some region moves tried to assign regions on the already dead 
> regionserver. Sometimes, due to transient issues, we see that active master 
> gets notified after few minutes (5+ minute in this case).
> {code:java}
> 2024-05-08 03:47:38,518 WARN  [RSProcedureDispatcher-pool-4790] 
> procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed 
> due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
> address=host1:61020 failed on local exception: 
> org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection 
> closed, try=0, retrying... {code}
> And as we know, we have infinite retries here, so it kept going on..
>  
> Eventually, SCP could be initiated only after active master discovered the 
> server as dead:
> {code:java}
> 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer - 
> Processing host1,61020,1713411866443; numProcessing=1
> 2024-05-08 03:50:01,038 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker - RegionServer ephemeral node deleted, processing 
> expiration [host1,61020,1713411866443] {code}
> leading to
> {code:java}
> 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833] 
> assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691, 
> state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51, 
> server=host1,61020,1713411866443 for region state=OPENING, 
> location=host1,61020,1713411866443, table=T1, 
> region=5cafbe54d5685acc6c4866758e67fd51, targetServer 
> host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code}
> This entire duration of outage could be avoided if we can fail-fast for 
> connection drop errors.
>  
> *Problem Statement:*
> Master initiated remote procedures are scheduled by RSProcedureDispatcher. If 
> it encounters specific errors on first retry (e.g. CallQueueTooBigException 
> or SaslException), it is guaranteed that the remote call has not reached the 
> regionserver, therefore the remote call is marked failed prompting the parent 
> procedure to select different target regionserver to resume the operation.
> If the first attempt is successful, RSProcedureDispatcher continues with 
> infinite retries. We can encounter valid case (e.g. 
> ConnectionClosedException) which is halting the remote operation. Without 
> manual intervention, it can cause significant delay upto several minutes or 
> hours to the region-in-transition.
>  
> *Proposed Solution:*
> The purpose of this Jira is to impose retry limit for specific error types 
> such that if the retry limit is reached, the master can recover the state of 
> the ongoing remote call failure by initiating SCP (ServerCrashProcedure) on 
> the target server. The SCP is going to override the TRSP 
> (TransitRegionStateProcedure) if required. This can ensure that the target 
> server has no region hosted online before we suspend the ongoing TRSP.
> Scheduling SCP for the target server will always lead to the regionserver in 
> stopped state. Either regionserver would be automatically stopped, or if the 
> regionserver is able to send the region report to master, master will reject 
> it, which will further lead to regionserver abort.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to