[jira] [Commented] (HBASE-29251) Procedure gets stuck if the procedure state cannot be persisted

Hudson (Jira) Thu, 24 Apr 2025 04:42:15 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-29251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947024#comment-17947024
 ]


Hudson commented on HBASE-29251:
--------------------------------

Results for branch branch-2
        [build #1262 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/]: 
(x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/General_20Nightly_20Build_20Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.3.5 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.3.6 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop 3.4.0 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1262/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test for HBase 2 {color}
(/) {color:green}+1 client integration test for 3.3.5 {color}
(/) {color:green}+1 client integration test for 3.3.6 {color}
(/) {color:green}+1 client integration test for 3.4.0 {color}
(/) {color:green}+1 client integration test for 3.4.1 {color}


> Procedure gets stuck if the procedure state cannot be persisted
> ---------------------------------------------------------------
>
>                 Key: HBASE-29251
>                 URL: https://issues.apache.org/jira/browse/HBASE-29251
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.18, 3.0.0-beta-1, 2.5.11, 2.6.2
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12
>
>
> When a given regionserver stops or aborts, the corresponding 
> ServerCrashProcedure is initiated by the active master. We have recently come 
> across a case where initial state of the SCP SERVER_CRASH_START could not be 
> persisted in the local region store:
> {code:java}
> 2025-04-09 19:00:23,538 ERROR [RegionServerTracker-0] 
> region.RegionProcedureStore - Failed to update proc pid=60020, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false
> java.io.InterruptedIOException: No ack received after 55s and a timeout of 55s
>     at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:938)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:692)
>     at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:580)
>     at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
>     at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:85)
>     at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:666)
>  {code}
>  
> This led to no further action on the SCP, it stayed stuck until the active 
> master was restarted manually.
> After the manual restart, new active master was able to proceed further with 
> SCP:
> {code:java}
> 2025-04-09 20:43:07,693 DEBUG [master/hmaster-3:60000:becomeActiveMaster] 
> procedure2.ProcedureExecutor - Stored pid=60771, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false
> 2025-04-09 20:44:15,312 INFO  [PEWorker-18] procedure2.ProcedureExecutor - 
> Finished pid=60771, state=SUCCESS; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false in 1 mins, 7.667 sec 
> {code}
>  
> While it is well known that for active master to be operate without 
> functional issues, the file system backing the master local region should be 
> healthy. It is however worth noting that hdfs can have issues and master 
> should be able to recover the procedures like SCP unless hdfs issues persist 
> for longer duration.
> A couple of proposals:
>  * Provide retries for the proc store persist failures
>  * Abort active master for new master to continue the recovery (deployment 
> systems usually ensure that the aborted servers are auto-started e.g. k8s or 
> ambari)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-29251) Procedure gets stuck if the procedure state cannot be persisted

Reply via email to