[ https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani updated HBASE-28638: --------------------------------- Summary: Fail-fast retry limit for specific errors to recover from remote procedure failure using server crash (was: Impose retry limit for specific errors to recover from remote procedure failure using server crash) > Fail-fast retry limit for specific errors to recover from remote procedure > failure using server crash > ----------------------------------------------------------------------------------------------------- > > Key: HBASE-28638 > URL: https://issues.apache.org/jira/browse/HBASE-28638 > Project: HBase > Issue Type: Sub-task > Components: amv2, master, Region Assignment > Affects Versions: 3.0.0-beta-1, 2.6.1, 2.5.10 > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.5.11, 2.6.2 > > > As per one of the recent incidents, some regions faced 5+ minute of > availability drop because before active master could initiate SCP for the > dead server, some region moves tried to assign regions on the already dead > regionserver. Sometimes, due to transient issues, we see that active master > gets notified after few minutes (5+ minute in this case). > {code:java} > 2024-05-08 03:47:38,518 WARN [RSProcedureDispatcher-pool-4790] > procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed > due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to > address=host1:61020 failed on local exception: > org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection > closed, try=0, retrying... {code} > And as we know, we have infinite retries here, so it kept going on.. > > Eventually, SCP could be initiated only after active master discovered the > server as dead: > {code:java} > 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer - > Processing host1,61020,1713411866443; numProcessing=1 > 2024-05-08 03:50:01,038 INFO [RegionServerTracker-0] > master.RegionServerTracker - RegionServer ephemeral node deleted, processing > expiration [host1,61020,1713411866443] {code} > leading to > {code:java} > 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833] > assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691, > state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51, > server=host1,61020,1713411866443 for region state=OPENING, > location=host1,61020,1713411866443, table=T1, > region=5cafbe54d5685acc6c4866758e67fd51, targetServer > host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code} > This entire duration of outage could be avoided if we can fail-fast for > connection drop errors. > > *Problem Statement:* > Master initiated remote procedures are scheduled by RSProcedureDispatcher. If > it encounters specific errors on first retry (e.g. CallQueueTooBigException > or SaslException), it is guaranteed that the remote call has not reached the > regionserver, therefore the remote call is marked failed prompting the parent > procedure to select different target regionserver to resume the operation. > If the first attempt is successful, RSProcedureDispatcher continues with > infinite retries. We can encounter valid case (e.g. > ConnectionClosedException) which is halting the remote operation. Without > manual intervention, it can cause significant delay upto several minutes or > hours to the region-in-transition. > > *Proposed Solution:* > The purpose of this Jira is to impose retry limit for specific error types > such that if the retry limit is reached, the master can recover the state of > the ongoing remote call failure by initiating SCP (ServerCrashProcedure) on > the target server. The SCP is going to override the TRSP > (TransitRegionStateProcedure) if required. This can ensure that the target > server has no region hosted online before we suspend the ongoing TRSP. > Scheduling SCP for the target server will always lead to the regionserver in > stopped state. Either regionserver would be automatically stopped, or if the > regionserver is able to send the region report to master, master will reject > it, which will further lead to regionserver abort. -- This message was sent by Atlassian Jira (v8.20.10#820010)