The problem was actually with the Snappy codec or the native Snappy libraries. After configuring the Snappy Java implementation, the cluster started without any problems.
I have a final question regarding the Hbase distributions. Can you please tell me the difference between the distributions: bin: https://www.apache.org/dyn/closer.lua/hbase/2.5.8/hbase-2.5.8-bin.tar.gz and hadoop3-bin: https://www.apache.org/dyn/closer.lua/hbase/2.5.8/hbase-2.5.8-hadoop3-bin.tar.gz I can't find a description of this. The same applies to the client libraries client-bin and hadoop3-client-bin. Best regards Udo > Am 30.04.2024 um 04:42 schrieb 张铎(Duo Zhang) <palomino...@gmail.com>: > > Oh, there is a typo, I mean the ServerCrashProcedure should not block other > procedures if it is in claim replication queue stage. > > 张铎(Duo Zhang) <palomino...@gmail.com>于2024年4月30日 周二10:41写道: > >> Sorry to be a pain as the procedure store is a big problem before HBase >> 2.3 so we have done a big refactoring on HBase 2.3+ so we have a migration >> which makes the upgrading a bit complicated. >> >> And on the upgrading, you do not need to mix up HBase and Hadoop, you can >> upgrading them separately. Second, rolling upgrading is also a bit >> complicated, so I suggest you try fully down/up upgrading first, if you >> have successfully done an upgrading, then you can start to try rolling >> upgrading. >> >> To your scenario, I suggest, you first upgrading Hadoop, including >> namenode and datanode, HBase should be functional after the upgrading. And >> then, as discussed above, turn off the balancer, view the master page to >> make sure there are no RITs and no procedures, then shutdown master, and >> then shutdown all the region servers. And then, start master(do not need to >> wait the master finishes start up, as it relies on meta region online, >> where we must have at least one region server), and then all the region >> servers, to see if the cluster can go back to normal. >> >> On the ServerCrashProcedure, it is blocked in claim replication queue, >> which should be blocked other procedures as the region assignment should >> have already been finished. Does your cluster has replication peers? If >> not, it is a bit strange that why your procedure is blocked in the claim >> replication queue stage… >> >> Thanks. >> >> Udo Offermann <udo.offerm...@zfabrik.de>于2024年4月29日 周一21:26写道: >> >>> This time we made progress. >>> I first upgraded the Master Hadoop and HBase wise (after making sure that >>> there are no regions in transition and no running procedures) with keeping >>> Zookeeper running. Master was started with new version 2.8.5 telling that >>> there are 6 nodes with inconsistent version (what was to be expected). Now >>> the startup process completes with "Starting cluster schema service >>> COMPLETE“, >>> all regions were assigned and the cluster seemed to be stable. >>> >>> Again there were no regions in transitions and no procedures running and >>> so I started to upgrade the data nodes one by one. >>> The problem now is that the new region servers are not assigned regions >>> except of 3: hbase:namespace, hbase:meta and one of our application level >>> tables (which is empty most of the time). >>> The more data nodes I migrated, the more regions were accumulated on the >>> nodes running the old version until the last old data node has managed all >>> regions except for 3. >>> >>> >>> >>> After all regions have been transitioned I migrated the last node which >>> yields that all regions are in transition and look like this one: >>> >>> 2185 2184 WAITING_TIMEOUT seritrack >>> TransitRegionStateProcedure table=tt_items, >>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN Mon Apr 29 14:12:36 >>> CEST 2024 Mon Apr 29 14:59:44 CEST 2024 pid=2185, ppid=2184, >>> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, >>> locked=true; TransitRegionStateProcedure table=tt_items, >>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN >>> >>> They are all waiting on this one: >>> >>> 2184 WAITING seritrack ServerCrashProcedure >>> datanode06ct.gmd9.intern,16020,1714378085579 Mon Apr 29 14:12:36 CEST >>> 2024 Mon Apr 29 14:12:36 CEST 2024 pid=2184, >>> state=WAITING:SERVER_CRASH_CLAIM_REPLICATION_QUEUES, locked=true; >>> ServerCrashProcedure datanode06ct.gmd9.intern,16020,1714378085579, >>> splitWal=true, meta=false >>> >>> Again „ServerCrashProcedure“! Why are they not processed? >>> Why is it so hard to upgrade the cluster? Is it worthwhile to take the >>> next stable version 2.5.8? >>> And - btw- what is the difference between the two distributions „bin“ and >>> „hadoop3-bin“? >>> >>> Best regards >>> Udo >>> >>> >>> >>> >>> >>>> Am 28.04.2024 um 03:03 schrieb 张铎(Duo Zhang) <palomino...@gmail.com>: >>>> >>>> Better turn it off, and observe the master page until there is no RITs >>>> and no other procedures, then call hbase-daemon.sh stop master, and >>>> then hbase-daemon.sh stop regionserver. >>>> >>>> I'm not 100% sure about the shell command, you'd better search try it >>>> by yourself. The key here is to stop master first and make sure there >>>> is no procedure, so we can safely remove MasterProcWALs, and then stop >>>> all region servers. >>>> >>>> Thanks. >>>> >>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 23:34写道: >>>>> >>>>> I know, but is it necessary or beneficial to turn it off - and if so - >>> when? >>>>> And what is your recommendation about stopping the region servers? Just >>>>> hbase-daemon.sh stop regionserver >>>>> or >>>>> gracefull_stop.sh localhost >>>>> ? >>>>> >>>>>> Am 26.04.2024 um 17:22 schrieb 张铎(Duo Zhang) <palomino...@gmail.com>: >>>>>> >>>>>> Turning off balancer is to make sure that the balancer will not >>>>>> schedule any procedures to balance the cluster. >>>>>> >>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 23:03写道: >>>>>>> >>>>>>> and what’s about turning of Hbase balancer before stopping hmaster? >>>>>>> >>>>>>>> Am 26.04.2024 um 17:00 schrieb Udo Offermann < >>> udo.offerm...@zfabrik.de>: >>>>>>>> >>>>>>>> So there is no need for >>>>>>>> >>>>>>>> hbase/bin/graceful_stop.sh localhost >>>>>>>> >>>>>>>> in order to stop the region servers? >>>>>>>> >>>>>>>>> Am 26.04.2024 um 16:51 schrieb 张铎(Duo Zhang) < >>> palomino...@gmail.com>: >>>>>>>>> >>>>>>>>> The key here is to make sure there are no procedures in HBase so we >>>>>>>>> are safe to move MasterProcWALs. >>>>>>>>> >>>>>>>>> And procedures can only be scheduled by master. >>>>>>>>> >>>>>>>>> So once there are no procedures in HBase, you should stop master >>>>>>>>> first, and then you are free to stop all the regionservers. And >>> then >>>>>>>>> you can proceed with the upgrading of hdfs/hadoop, and then restart >>>>>>>>> master and region servers with new versions. >>>>>>>>> >>>>>>>>> You can have a try. >>>>>>>>> >>>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 22:47写道: >>>>>>>>>> >>>>>>>>>> Ah, this sounds interesting! >>>>>>>>>> >>>>>>>>>> I need to think about how I'm going to manage this together with >>> upgrading Hadoop. My strategy was to first upgrade Hadoop on all machines >>> and then start HBase with the new version on all machines. But now I have >>> to upgrade the master first - Hadoop and Hbase wise - and then the data >>> nodes one by one - again Hadoop and Hbase wise. Is it also safe to do the >>> Hbase upgrade „inside“ a rolling Hadoop upgrade? >>>>>>>>>> >>>>>>>>>> I mean: >>>>>>>>>> >>>>>>>>>> 1) Upgrade master >>>>>>>>>> >>>>>>>>>> make sure there are no hbase procedures running >>>>>>>>>> >>>>>>>>>> hdfs dfsadmin -safemode enter >>>>>>>>>> hdfs dfsadmin -rollingUpgrade prepare >>>>>>>>>> kill hmaster >>>>>>>>>> kill/stop zookeeper ??? >>>>>>>>>> hdfs dfs -rm /hbase/MasterProcWALs/* >>>>>>>>>> stop secondary and namenode >>>>>>>>>> SWITCH-TO-NEW-VERSION >>>>>>>>>> hadoop-daemon.sh start namenode -rollingUpgrade started >>>>>>>>>> start secondary >>>>>>>>>> start zookeeper >>>>>>>>>> start hmaster >>>>>>>>>>> The cluster should be in an intermediate state, where master >>>>>>>>>>> is in new version but region servers remain in old version, but >>> it >>>>>>>>>>> should be functional. >>>>>>>>>> >>>>>>>>>> 2) Upgrade data node 1..6 >>>>>>>>>> stop / kill region server ??? >>>>>>>>>> hdfs dfsadmin -shutdownDatanode localhost:50020 upgrade >>>>>>>>>> SWITCH-TO-NEW-VERSION >>>>>>>>>> start datanode >>>>>>>>>> start region server >>>>>>>>>> >>>>>>>>>> 3) Finalize upgrade >>>>>>>>>> hdfs dfsadmin -rollingUpgrade finalize >>>>>>>>>> start yarn processes >>>>>>>>>> >>>>>>>>>> Hmm, sounds like a plan, what do you think? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Am 26.04.2024 um 16:25 schrieb 张铎(Duo Zhang) < >>> palomino...@gmail.com>: >>>>>>>>>>> >>>>>>>>>>> I think the cluster is not in a correct state, none of the SCPs >>> has >>>>>>>>>>> carrying meta = true but meta is not online... >>>>>>>>>>> >>>>>>>>>>> If you have gracefully shutdown all the region servers, you >>> should not >>>>>>>>>>> delete all the MasterWALProcs, as there are already SCPs in it. >>> This >>>>>>>>>>> is how we deal with graceful shutdown, master just does not >>> process >>>>>>>>>>> the SCPs, but we do have already scheduled the SCPs... >>>>>>>>>>> >>>>>>>>>>> What I said above, is to make sure that there are no procedures >>> in the >>>>>>>>>>> system, then kill the master directly, without shutting down all >>> the >>>>>>>>>>> region servers, remove MasterWALProcs, and then restart master >>> with >>>>>>>>>>> new code. The cluster should be in an intermediate state, where >>> master >>>>>>>>>>> is in new version but region servers remain in old version, but >>> it >>>>>>>>>>> should be functional. And then you can rolling upgrade the region >>>>>>>>>>> servers one by one. >>>>>>>>>>> >>>>>>>>>>> You could try it again. >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 22:03写道: >>>>>>>>>>>> >>>>>>>>>>>>> I think hostnames should be case insensitive? So why is there a >>>>>>>>>>>>> 'DATANODE01CT' and then a 'DATANODE01ct'? >>>>>>>>>>>> Well observed ;-) I was asked by our customer to disguise the >>> server names, and I missed some of them when searching and replacing, but I >>> can assure you that all server names are correct and we have never had any >>> problems with them. >>>>>>>>>>>> >>>>>>>>>>>> The cluster consists of 7 servers: one master and 6 data nodes >>> running on Alma Linux (version 8 I believe) and Java 8 (updated only some >>> weeks ago). Master is running Hadoop name node, secondary name node, yarn >>> resource manager and history server as well as Hbase Zookeeper and Master. >>> The data nodes are running data node, region server and Yarn node manager. >>> They're all virtual machines at the same size ram (16GB) and cpu wise (4 >>> cores). The basic setup is from 2015 (with hbase 0.9 and we never change it >>> except upgrading to HBase 1.0 and to Hbase 2.2.5 in 2020), thus we have >>> been running Hadoop/HBase for almost 10 years now without any major >>> problems. >>>>>>>>>>>> >>>>>>>>>>>> The HBCKServerCrashProcedure comes from my attempt to recover >>> the cluster as you advised me the other day: >>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Then use HBCK2, to schedule a SCP for this region server, >>> to see if it >>>>>>>>>>>>>>>>>> can fix the problem. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>> https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This is the document for HBCK2, you should use the >>> scheduleRecoveries command. >>>>>>>>>>>> >>>>>>>>>>>> You can take it as an act of desperation ;-) >>>>>>>>>>>> >>>>>>>>>>>> I will take care about log4j2 but how can I get the cluster up >>> and running? >>>>>>>>>>>> >>>>>>>>>>>> Best regards >>>>>>>>>>>> Udo >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Am 26.04.2024 um 15:29 schrieb 张铎(Duo Zhang) < >>> palomino...@gmail.com>: >>>>>>>>>>>>> >>>>>>>>>>>>> It is a bit strange that why do you have a >>> HBCKServerCrashProcedure? >>>>>>>>>>>>> It should only appear when you use HBCK2 to force schedule a >>> SCP. >>>>>>>>>>>>> And it is also a bit strange that all the SCPs are marked as >>> not >>>>>>>>>>>>> carrying meta... How many region servers do you have in your >>> cluster? >>>>>>>>>>>>> >>>>>>>>>>>>> I think hostnames should be case insensitive? So why is there a >>>>>>>>>>>>> 'DATANODE01CT' and then a 'DATANODE01ct'? >>>>>>>>>>>>> >>>>>>>>>>>>> And for hbase 2.5.x, we have switched to use log4j2, instead >>> of log4j. >>>>>>>>>>>>> >>> https://github.com/apache/hbase/blob/branch-2.5/conf/log4j2.properties >>>>>>>>>>>>> >>>>>>>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 >>> 19:59写道: >>>>>>>>>>>>>> >>>>>>>>>>>>>> After resetting the VMs, we started a new upgrade attempt. >>>>>>>>>>>>>> The Hadoop part ran smoothly again, but we got stuck again >>> with HBase. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Before upgrading HBase I turned off the balancer and stopped >>> all region servers gracefully. I also deleted the MasterProcWALs folder in >>> hdfs. >>>>>>>>>>>>>> Then I startet the master and region servers with version >>> 2.5.7. Again the master stops at „Starting assignment manager“ task. >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are a number of server crash procedures that do not >>> appear to be processed: >>>>>>>>>>>>>> >>>>>>>>>>>>>> HBase Shell >>>>>>>>>>>>>> Use "help" to get list of supported commands. >>>>>>>>>>>>>> Use "exit" to quit this interactive shell. >>>>>>>>>>>>>> For Reference, please visit: >>> http://hbase.apache.org/2.0/book.html#shell >>>>>>>>>>>>>> Version 2.5.7, r6788f98356dd70b4a7ff766ea7a8298e022e7b95, Thu >>> Dec 14 15:59:16 PST 2023 >>>>>>>>>>>>>> Took 0.0016 seconds >>>>>>>>>>>>>> hbase:001:0> list_procedures >>>>>>>>>>>>>> PID Name State Submitted Last_Update Parameters >>>>>>>>>>>>>> 1 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 12:22:12 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE01CT", "port"=>16020, >>> "startCode"=>"1714126714199"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 2 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 12:22:18 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE02CT", "port"=>16020, >>> "startCode"=>"1714126737220"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 3 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 12:22:24 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE03CT", "port"=>16020, >>> "startCode"=>"1714126742645"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 4 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 12:22:37 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE05CT", "port"=>16020, >>> "startCode"=>"1714126754579"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 5 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 12:22:44 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE06CT", "port"=>16020, >>> "startCode"=>"1714126762089"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 6 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 13:13:43 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE01ct", "port"=>16020, >>> "startCode"=>"1714127123596"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 7 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 13:13:53 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE02ct", "port"=>16020, >>> "startCode"=>"1714127133136"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 8 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 13:14:07 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE03ct", "port"=>16020, >>> "startCode"=>"1714127138682"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 9 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 13:14:17 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE05ct", "port"=>16020, >>> "startCode"=>"1714127155080"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 10 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 13:14:30 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE06ct", "port"=>16020, >>> "startCode"=>"1714127158551"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 11 >>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE >>> 2024-04-26 13:16:57 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE04ct", "port"=>16020, >>> "startCode"=>"1714126747741"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 12 >>> org.apache.hadoop.hbase.master.procedure.HBCKServerCrashProcedure RUNNABLE >>> 2024-04-26 13:22:16 +0200 2024-04-26 13:22:16 +0200 [{"state"=>[1, 3]}, >>> {"serverName"=>{"hostName"=>"DATANODE03CT", "port"=>16020, >>> "startCode"=>"1714130315364"}, "carryingMeta"=>false, >>> "shouldSplitWal"=>true}] >>>>>>>>>>>>>> 12 row(s) >>>>>>>>>>>>>> Took 0.6564 seconds >>>>>>>>>>>>>> >>>>>>>>>>>>>> Strangely enough, the log files are empty: >>>>>>>>>>>>>> >>>>>>>>>>>>>> cat logs/hbase-seritrack-master-server.out >>>>>>>>>>>>>> 13:31:57.280 >>> [ActiveMasterInitializationMonitor-1714130217278] ERROR >>> org.apache.hadoop.hbase.master.MasterInitializationMonitor - Master failed >>> to complete initialization after 900000ms. Please consider submitting a bug >>> report including a thread dump of this process. >>>>>>>>>>>>>> >>>>>>>>>>>>>> cat logs/hbase-seritrack-master-server.log >>>>>>>>>>>>>> Fri Apr 26 13:16:47 CEST 2024 Starting master on master-server >>>>>>>>>>>>>> core file size (blocks, -c) 0 >>>>>>>>>>>>>> data seg size (kbytes, -d) unlimited >>>>>>>>>>>>>> scheduling priority (-e) 0 >>>>>>>>>>>>>> file size (blocks, -f) unlimited >>>>>>>>>>>>>> pending signals (-i) 95119 >>>>>>>>>>>>>> max locked memory (kbytes, -l) 64 >>>>>>>>>>>>>> max memory size (kbytes, -m) unlimited >>>>>>>>>>>>>> open files (-n) 1024 >>>>>>>>>>>>>> pipe size (512 bytes, -p) 8 >>>>>>>>>>>>>> POSIX message queues (bytes, -q) 819200 >>>>>>>>>>>>>> real-time priority (-r) 0 >>>>>>>>>>>>>> stack size (kbytes, -s) 8192 >>>>>>>>>>>>>> cpu time (seconds, -t) unlimited >>>>>>>>>>>>>> max user processes (-u) 95119 >>>>>>>>>>>>>> virtual memory (kbytes, -v) unlimited >>>>>>>>>>>>>> file locks (-x) unlimited >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have checked the settings: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Submitted Log Name: org.apache.hadoop.hbase >>>>>>>>>>>>>> Log Class: org.apache.logging.slf4j.Log4jLogger >>>>>>>>>>>>>> Effective level: ERROR >>>>>>>>>>>>>> >>>>>>>>>>>>>> I then explicitly set the log level again: >>>>>>>>>>>>>> >>>>>>>>>>>>>> cat hbase/conf/log4j.properties >>>>>>>>>>>>>> [...] >>>>>>>>>>>>>> log4j.logger.org.apache.hadoop.hbase=INFO >>>>>>>>>>>>>> >>>>>>>>>>>>>> und >>>>>>>>>>>>>> export HBASE_ROOT_LOGGER=hbase.root.logger=INFO,console >>>>>>>>>>>>>> >>>>>>>>>>>>>> And then restarted HMaster - without success. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Why does the log level remain at ERROR? >>>>>>>>>>>>>> I'm pretty sure that the levels will be set to INFO at some >>> point later on but they remain at level ERROR during the startup phase. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here I post the Zookeeper Dump: >>>>>>>>>>>>>> >>>>>>>>>>>>>> HBase is rooted at /hbase >>>>>>>>>>>>>> Active master address: >>>>>>>>>>>>>> master-server,16000,1714130208769 >>>>>>>>>>>>>> Backup master addresses: >>>>>>>>>>>>>> Region server holding hbase:meta: >>>>>>>>>>>>>> DATANODE03ct,16020,1714122680513 >>>>>>>>>>>>>> Region servers: >>>>>>>>>>>>>> DATANODE06ct,16020,1714130693358 >>>>>>>>>>>>>> DATANODE03ct,16020,1714130672936 >>>>>>>>>>>>>> DATANODE02ct,16020,1714130665456 >>>>>>>>>>>>>> DATANODE01ct,16020,1714130653350 >>>>>>>>>>>>>> DATANODE04ct,16020,1714130248620 >>>>>>>>>>>>>> Quorum Server Statistics: >>>>>>>>>>>>>> master-server:2181 >>>>>>>>>>>>>> stat is not executed because it is not in the whitelist. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> What do you have to do to solve the server crash procedures? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best regards >>>>>>>>>>>>>> Udo >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 23.04.2024 um 09:36 schrieb 张铎(Duo Zhang) < >>> palomino...@gmail.com>: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Strange, I checked the code, it seems we get NPE on this line >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>> https://github.com/apache/hbase/blob/4d7ce1aac724fbf09e526fc422b5a11e530c32f0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2872 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Could you please confirm that you connect to the correct >>> active master >>>>>>>>>>>>>>> which is hanging? It seems that you are connecting the backup >>>>>>>>>>>>>>> master... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 张铎(Duo Zhang) <palomino...@gmail.com> 于2024年4月23日周二 15:31写道: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ah, NPE usually means a code bug, then there is no simple >>> way to fix >>>>>>>>>>>>>>>> it, need to take a deep look on the code :( >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sorry. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月22日周一 >>> 15:32写道: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Unfortunately not. >>>>>>>>>>>>>>>>> I’ve found the node hosting the meta region and was able >>> to run hack scheduleRecoveries using hbase-operator-tools-1.2.0. >>>>>>>>>>>>>>>>> The tool however stops with an NPE: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 09:22:00.532 [main] WARN >>> org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop >>> library for your platform... using builtin-java classes where applicable >>>>>>>>>>>>>>>>> 09:22:00.703 [main] INFO >>> org.apache.hadoop.conf.Configuration.deprecation - hbase.client.pause.cqtbe >>> is deprecated. Instead, use hbase.client.pause.server.overloaded >>>>>>>>>>>>>>>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client >>> environment:zookeeper.version=3.8.3-6ad6d364c7c0bcf0de452d54ebefa3058098ab56, >>> built on 2023-10-05 10:34 UTC >>>>>>>>>>>>>>>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:host.name=HBaseMaster.gmd9.intern >>>>>>>>>>>>>>>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:java.version=1.8.0_402 >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:java.vendor=Red Hat, Inc. >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client >>> environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-2.el8.x86_64/jre >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client >>> environment:java.class.path=hbase-operator-tools-1.2.0/hbase-hbck2/hbase-hbck2-1.2.0.jar:hbase/conf:/opt/seritrack/tt/jdk/lib/tools.jar:/opt/seritrack/tt/nosql/hbase:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-mapreduce-2.5.7.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/audience-annotations-0.13.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/commons-logging-1.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/htrace-core4-4.1.0-incubating.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jcl-over-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jul-to-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-api-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-context-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-semconv-1.15.0-alpha.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/slf4j-api-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-client-2.5.7.jar:/opt/seritrack/tt/nosql/pl_nosql_ext/libs/pl_nosql_ext-3.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-1.2-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-core-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-slf4j-impl-2.17.2.jar:/opt/seritrack/tt/prometheus_exporters/jmx_exporter/jmx_prometheus_javaagent.jar >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client >>> environment:java.library.path=/opt/seritrack/tt/nosql/hadoop/lib/native >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:java.io.tmpdir=/tmp >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:java.compiler=<NA> >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:os.name=Linux >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:os.arch=amd64 >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:os.version=4.18.0-513.18.1.el8_9.x86_64 >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:user.name=seritrack >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:user.home=/opt/seritrack >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:user.dir=/opt/seritrack/tt/nosql_3.0 >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:os.memory.free=275MB >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:os.memory.max=2966MB >>>>>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Client environment:os.memory.total=361MB >>>>>>>>>>>>>>>>> 09:22:00.771 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - >>> Initiating client connection, connectString=HBaseMaster:2181 >>> sessionTimeout=90000 >>> watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$45/1091799416@aed32c5 >>>>>>>>>>>>>>>>> 09:22:00.774 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.common.X509Util - >>> Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable >>> client-initiated TLS renegotiation >>>>>>>>>>>>>>>>> 09:22:00.777 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxnSocket >>> - jute.maxbuffer value is 1048575 Bytes >>>>>>>>>>>>>>>>> 09:22:00.785 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn - >>> zookeeper.request.timeout value is 0. feature enabled=false >>>>>>>>>>>>>>>>> 09:22:00.793 >>> [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f-SendThread(HBaseMaster:2181)] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn - >>> Opening socket connection to server HBaseMaster/10.21.204.230:2181. >>>>>>>>>>>>>>>>> 09:22:00.793 >>> [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f-SendThread(HBaseMaster:2181)] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn - SASL >>> config status: Will not attempt to authenticate using SASL (unknown error) >>>>>>>>>>>>>>>>> 09:22:00.797 >>> [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f-SendThread(HBaseMaster:2181)] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn - >>> Socket connection established, initiating session, client: / >>> 10.21.204.230:41072, server: HBaseMaster/10.21.204.230:2181 >>>>>>>>>>>>>>>>> 09:22:00.801 >>> [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f-SendThread(HBaseMaster:2181)] >>> INFO org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn - >>> Session establishment complete on server HBaseMaster/10.21.204.230:2181, >>> session id = 0x10009a4f379001e, negotiated timeout = 90000 >>>>>>>>>>>>>>>>> -1 >>>>>>>>>>>>>>>>> Exception in thread "main" java.io.IOException: >>> org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: >>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): >>> java.io.IOException >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:479) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) >>>>>>>>>>>>>>>>> Caused by: java.lang.NullPointerException >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.master.MasterRpcServices.shouldSubmitSCP(MasterRpcServices.java:2872) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2600) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415) >>>>>>>>>>>>>>>>> ... 3 more >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:198) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.client.Hbck.scheduleServerCrashProcedure(Hbck.java:128) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.HBCK2.scheduleRecoveries(HBCK2.java:418) >>>>>>>>>>>>>>>>> at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:960) >>>>>>>>>>>>>>>>> at org.apache.hbase.HBCK2.run(HBCK2.java:830) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) >>>>>>>>>>>>>>>>> at org.apache.hbase.HBCK2.main(HBCK2.java:1145) >>>>>>>>>>>>>>>>> Caused by: >>> org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: >>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): >>> java.io.IOException >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:479) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) >>>>>>>>>>>>>>>>> Caused by: java.lang.NullPointerException >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.master.MasterRpcServices.shouldSubmitSCP(MasterRpcServices.java:2872) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2600) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415) >>>>>>>>>>>>>>>>> ... 3 more >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:340) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:92) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:595) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$BlockingStub.scheduleServerCrashProcedure(MasterProtos.java) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:190) >>>>>>>>>>>>>>>>> ... 7 more >>>>>>>>>>>>>>>>> Caused by: >>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): >>> java.io.IOException >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:479) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) >>>>>>>>>>>>>>>>> Caused by: java.lang.NullPointerException >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.master.MasterRpcServices.shouldSubmitSCP(MasterRpcServices.java:2872) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2600) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415) >>>>>>>>>>>>>>>>> ... 3 more >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:388) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:199) >>>>>>>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:220) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>>>>>>>>>>>>>>>> at >>> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:750) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Am 20.04.2024 um 15:53 schrieb 张铎(Duo Zhang) < >>> palomino...@gmail.com>: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> OK, it was waitForMetaOnline. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Maybe the problem is that you do have some correct >>> procedures before >>>>>>>>>>>>>>>>>> upgrading, like ServerCrashProcedure, but then you delete >>> all the >>>>>>>>>>>>>>>>>> procedure wals so the ServerCrashProcedure is also gone, >>> so meta can >>>>>>>>>>>>>>>>>> never be online. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Please check the /hbase/meta-region-server znode on >>> zookeeper, dump >>>>>>>>>>>>>>>>>> its content, it is protobuf based but anyway, you could >>> see the >>>>>>>>>>>>>>>>>> encoded server name which hosts meta region. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Then use HBCK2, to schedule a SCP for this region server, >>> to see if it >>>>>>>>>>>>>>>>>> can fix the problem. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>> https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This is the document for HBCK2, you should use the >>> scheduleRecoveries command. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hope this could fix your problem. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thread 92 (master/masterserver:16000:becomeActiveMaster): >>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>> Blocked count: 165 >>>>>>>>>>>>>>>>>> Waited count: 404 >>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>> java.lang.Thread.sleep(Native Method) >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:125) >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1358) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.HMaster.waitForMetaOnline(HMaster.java:1328) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1069) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2405) >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.HMaster.lambda$null$0(HMaster.java:565) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.HMaster$$Lambda$265/1598878738.run(Unknown >>>>>>>>>>>>>>>>>> Source) >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187) >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177) >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.HMaster.lambda$run$1(HMaster.java:562) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.HMaster$$Lambda$264/1129144214.run(Unknown >>>>>>>>>>>>>>>>>> Source) >>>>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de <mailto: >>> udo.offerm...@zfabrik.de>> 于2024年4月20日周六 21:13写道: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Master status for >>> masterserver.gmd9.intern,16000,1713515965162 as of Fri >>>>>>>>>>>>>>>>>>> Apr 19 10:55:22 CEST 2024 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Version Info: >>>>>>>>>>>>>>>>>>> >>> =========================================================== >>>>>>>>>>>>>>>>>>> HBase 2.5.7 >>>>>>>>>>>>>>>>>>> Source code repository >>>>>>>>>>>>>>>>>>> git://buildbox.localdomain/home/apurtell/tmp/RM/hbase >>>>>>>>>>>>>>>>>>> revision=6788f98356dd70b4a7ff766ea7a8298e022e7b95 >>>>>>>>>>>>>>>>>>> Compiled by apurtell on Thu Dec 14 15:59:16 PST 2023 >>>>>>>>>>>>>>>>>>> From source with checksum >>>>>>>>>>>>>>>>>>> >>> 1501d7fdf72398791ee335a229d099fc972cea7c2a952da7622eb087ddf975361f107cbbbee5d0ad6f603466e9afa1f4fd242ffccbd4371eb0b56059bb3b5402 >>>>>>>>>>>>>>>>>>> Hadoop 2.10.2 >>>>>>>>>>>>>>>>>>> Source code repository Unknown >>>>>>>>>>>>>>>>>>> revision=965fd380006fa78b2315668fbc7eb432e1d8200f >>>>>>>>>>>>>>>>>>> Compiled by ubuntu on 2022-05-25T00:12Z >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Tasks: >>>>>>>>>>>>>>>>>>> >>> =========================================================== >>>>>>>>>>>>>>>>>>> Task: Master startup >>>>>>>>>>>>>>>>>>> Status: RUNNING:Starting assignment manager >>>>>>>>>>>>>>>>>>> Running for 954s >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Task: Flushing >>> master:store,,1.1595e783b53d99cd5eef43b6debb2682. >>>>>>>>>>>>>>>>>>> Status: COMPLETE:Flush successful flush >>> result:CANNOT_FLUSH_MEMSTORE_EMPTY, >>>>>>>>>>>>>>>>>>> failureReason:Nothing to flush,flush seq id14 >>>>>>>>>>>>>>>>>>> Completed 49s ago >>>>>>>>>>>>>>>>>>> Ran for 0s >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Task: >>> RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000 >>>>>>>>>>>>>>>>>>> Status: WAITING:Waiting for a call >>>>>>>>>>>>>>>>>>> Running for 951s >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Task: >>> RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000 >>>>>>>>>>>>>>>>>>> Status: WAITING:Waiting for a call >>>>>>>>>>>>>>>>>>> Running for 951s >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Servers: >>>>>>>>>>>>>>>>>>> >>> =========================================================== >>>>>>>>>>>>>>>>>>> servername1ct.gmd9.intern,16020,1713514863737: >>> requestsPerSecond=0.0, >>>>>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=37.0MB, >>> maxHeapMB=2966.0MB, >>>>>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0, >>>>>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0, >>> storefileUncompressedSizeMB=0, >>>>>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0, >>>>>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0, >>> rootIndexSizeKB=0, >>>>>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, >>> totalCompactingKVs=0, >>>>>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN, >>> coprocessors=[] >>>>>>>>>>>>>>>>>>> servername2ct.gmd9.intern,16020,1713514925960: >>> requestsPerSecond=0.0, >>>>>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=20.0MB, >>> maxHeapMB=2966.0MB, >>>>>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0, >>>>>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0, >>> storefileUncompressedSizeMB=0, >>>>>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0, >>>>>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0, >>> rootIndexSizeKB=0, >>>>>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, >>> totalCompactingKVs=0, >>>>>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN, >>> coprocessors=[] >>>>>>>>>>>>>>>>>>> servername3ct.gmd9.intern,16020,1713514937151: >>> requestsPerSecond=0.0, >>>>>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=67.0MB, >>> maxHeapMB=2966.0MB, >>>>>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0, >>>>>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0, >>> storefileUncompressedSizeMB=0, >>>>>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0, >>>>>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0, >>> rootIndexSizeKB=0, >>>>>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, >>> totalCompactingKVs=0, >>>>>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN, >>> coprocessors=[] >>>>>>>>>>>>>>>>>>> servername4ct.gmd9.intern,16020,1713514968019: >>> requestsPerSecond=0.0, >>>>>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=24.0MB, >>> maxHeapMB=2966.0MB, >>>>>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0, >>>>>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0, >>> storefileUncompressedSizeMB=0, >>>>>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0, >>>>>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0, >>> rootIndexSizeKB=0, >>>>>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, >>> totalCompactingKVs=0, >>>>>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN, >>> coprocessors=[] >>>>>>>>>>>>>>>>>>> servername5ct.gmd9.intern,16020,1713514979294: >>> requestsPerSecond=0.0, >>>>>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=58.0MB, >>> maxHeapMB=2966.0MB, >>>>>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0, >>>>>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0, >>> storefileUncompressedSizeMB=0, >>>>>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0, >>>>>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0, >>> rootIndexSizeKB=0, >>>>>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, >>> totalCompactingKVs=0, >>>>>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN, >>> coprocessors=[] >>>>>>>>>>>>>>>>>>> servername6ct.gmd9.intern,16020,1713514994770: >>> requestsPerSecond=0.0, >>>>>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=31.0MB, >>> maxHeapMB=2966.0MB, >>>>>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0, >>>>>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0, >>> storefileUncompressedSizeMB=0, >>>>>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0, >>>>>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0, >>> rootIndexSizeKB=0, >>>>>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, >>> totalCompactingKVs=0, >>>>>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN, >>> coprocessors=[] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Regions-in-transition: >>>>>>>>>>>>>>>>>>> >>> =========================================================== >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Executors: >>>>>>>>>>>>>>>>>>> >>> =========================================================== >>>>>>>>>>>>>>>>>>> Status for executor: >>>>>>>>>>>>>>>>>>> >>> Executor-4-MASTER_META_SERVER_OPERATIONS-master/masterserver:16000 >>>>>>>>>>>>>>>>>>> ======================================= >>>>>>>>>>>>>>>>>>> 0 events queued, 0 running >>>>>>>>>>>>>>>>>>> Status for executor: >>>>>>>>>>>>>>>>>>> >>> Executor-6-MASTER_SNAPSHOT_OPERATIONS-master/masterserver:16000 >>>>>>>>>>>>>>>>>>> ======================================= >>>>>>>>>>>>>>>>>>> 0 events queued, 0 running >>>>>>>>>>>>>>>>>>> Status for executor: >>>>>>>>>>>>>>>>>>> >>> Executor-3-MASTER_SERVER_OPERATIONS-master/masterserver:16000 >>>>>>>>>>>>>>>>>>> ======================================= >>>>>>>>>>>>>>>>>>> 0 events queued, 0 running >>>>>>>>>>>>>>>>>>> Status for executor: >>> Executor-5-M_LOG_REPLAY_OPS-master/masterserver:16000 >>>>>>>>>>>>>>>>>>> ======================================= >>>>>>>>>>>>>>>>>>> 0 events queued, 0 running >>>>>>>>>>>>>>>>>>> Status for executor: >>>>>>>>>>>>>>>>>>> Executor-2-MASTER_CLOSE_REGION-master/masterserver:16000 >>>>>>>>>>>>>>>>>>> ======================================= >>>>>>>>>>>>>>>>>>> 0 events queued, 0 running >>>>>>>>>>>>>>>>>>> Status for executor: >>>>>>>>>>>>>>>>>>> >>> Executor-7-MASTER_MERGE_OPERATIONS-master/masterserver:16000 >>>>>>>>>>>>>>>>>>> ======================================= >>>>>>>>>>>>>>>>>>> 0 events queued, 0 running >>>>>>>>>>>>>>>>>>> Status for executor: >>>>>>>>>>>>>>>>>>> >>> Executor-8-MASTER_TABLE_OPERATIONS-master/masterserver:16000 >>>>>>>>>>>>>>>>>>> ======================================= >>>>>>>>>>>>>>>>>>> 0 events queued, 0 running >>>>>>>>>>>>>>>>>>> Status for executor: >>>>>>>>>>>>>>>>>>> Executor-1-MASTER_OPEN_REGION-master/masterserver:16000 >>>>>>>>>>>>>>>>>>> ======================================= >>>>>>>>>>>>>>>>>>> 0 events queued, 0 running >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Stacks: >>>>>>>>>>>>>>>>>>> >>> =========================================================== >>>>>>>>>>>>>>>>>>> Process Thread Dump: >>>>>>>>>>>>>>>>>>> 131 active threads >>>>>>>>>>>>>>>>>>> Thread 186 (WAL-Archive-0): >>>>>>>>>>>>>>>>>>> State: WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 5 >>>>>>>>>>>>>>>>>>> Waited count: 11 >>>>>>>>>>>>>>>>>>> Waiting on >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42f44d41 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750) >>>>>>>>>>>>>>>>>>> Thread 185 (Close-WAL-Writer-0): >>>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 2 >>>>>>>>>>>>>>>>>>> Waited count: 6 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750) >>>>>>>>>>>>>>>>>>> Thread 152 (Session-Scheduler-3bc4ef12-1): >>>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 1 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750) >>>>>>>>>>>>>>>>>>> Thread 151 >>>>>>>>>>>>>>>>>>> >>> (master/masterserver:16000:becomeActiveMaster-HFileCleaner.small.0-1713515973400): >>>>>>>>>>>>>>>>>>> State: WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 1 >>>>>>>>>>>>>>>>>>> Waiting on >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@58626ec5 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.PriorityBlockingQueue.take(PriorityBlockingQueue.java:549) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.HFileCleaner.consumerLoop(HFileCleaner.java:285) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.HFileCleaner$2.run(HFileCleaner.java:269) >>>>>>>>>>>>>>>>>>> Thread 150 >>>>>>>>>>>>>>>>>>> >>> (master/masterserver:16000:becomeActiveMaster-HFileCleaner.large.0-1713515973400): >>>>>>>>>>>>>>>>>>> State: WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 1 >>>>>>>>>>>>>>>>>>> Waiting on >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@18916420 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.util.StealJobQueue.take(StealJobQueue.java:101) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.HFileCleaner.consumerLoop(HFileCleaner.java:285) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.HFileCleaner$1.run(HFileCleaner.java:254) >>>>>>>>>>>>>>>>>>> Thread 149 (snapshot-hfile-cleaner-cache-refresher): >>>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 4 >>>>>>>>>>>>>>>>>>> Waited count: 11 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> java.lang.Object.wait(Native Method) >>>>>>>>>>>>>>>>>>> java.util.TimerThread.mainLoop(Timer.java:552) >>>>>>>>>>>>>>>>>>> java.util.TimerThread.run(Timer.java:505) >>>>>>>>>>>>>>>>>>> Thread 148 (master/masterserver:16000.Chore.1): >>>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 2 >>>>>>>>>>>>>>>>>>> Waited count: 10 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750) >>>>>>>>>>>>>>>>>>> Thread 147 (OldWALsCleaner-1): >>>>>>>>>>>>>>>>>>> State: WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 1 >>>>>>>>>>>>>>>>>>> Waiting on >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a6a3b7e >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.LogCleaner.deleteFile(LogCleaner.java:172) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.LogCleaner.lambda$createOldWalsCleaner$1(LogCleaner.java:152) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.LogCleaner$$Lambda$494/556458560.run(Unknown >>>>>>>>>>>>>>>>>>> Source) >>>>>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750) >>>>>>>>>>>>>>>>>>> Thread 146 (OldWALsCleaner-0): >>>>>>>>>>>>>>>>>>> State: WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 1 >>>>>>>>>>>>>>>>>>> Waiting on >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a6a3b7e >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.LogCleaner.deleteFile(LogCleaner.java:172) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.LogCleaner.lambda$createOldWalsCleaner$1(LogCleaner.java:152) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.master.cleaner.LogCleaner$$Lambda$494/556458560.run(Unknown >>>>>>>>>>>>>>>>>>> Source) >>>>>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750) >>>>>>>>>>>>>>>>>>> Thread 139 (PEWorker-16): >>>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 16 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:165) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:147) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2113) >>>>>>>>>>>>>>>>>>> Thread 138 (PEWorker-15): >>>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 16 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:165) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:147) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2113) >>>>>>>>>>>>>>>>>>> Thread 137 (PEWorker-14): >>>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 16 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:165) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:147) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2113) >>>>>>>>>>>>>>>>>>> Thread 136 (PEWorker-13): >>>>>>>>>>>>>>>>>>> State: TIMED_WAITING >>>>>>>>>>>>>>>>>>> Blocked count: 0 >>>>>>>>>>>>>>>>>>> Waited count: 16 >>>>>>>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>>>>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >>>>>>>>> >> >>