Dear HBase community, We are on hbase 2.4.17 and are encountering an issue where a region server will have a great many of its handlers "stuck", in the sense that they are shown as active but nothing is really happening.
It looks similar to https://issues.apache.org/jira/browse/HBASE-28494 , though I am not exactly sure. We have checked our HDFS and it looks fine and is not particularly busy. <name>hbase.wal.provider</name> <value>multiwal</value> <name>hbase.wal.regiongrouping.numgroups</name> <value>2</value> It seems to start with this error: 2024-11-07 16:31:31,594 ERROR [MemStoreFlusher.7] regionserver.MemStoreFlusher: Cache flush failed for region xxx org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for txid=802200, WAL system stuck? at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:171) at org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.blockOnSync(AbstractWAL.java:246) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:690) A thread dump after a short while then looks like this (excerpt): Thread 13878: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, java.util.concurrent.TimeUnit) @bci=97, line=2163 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(long) @bci=26, line=169 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.blockOnSync(org.apache.hadoop.hbase.regionserver.wal.SyncFuture) @bci=26, line=246 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(long, boolean) @bci=110, line=713 (Compiled frame) - org.apache.hadoop.hbase.regionserver.HRegion.sync(long, org.apache.hadoop.hbase.client.Durability) @bci=100, line=8729 (Compiled frame) - org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(org.apache.hadoop.hbase.wal.WALEdit, org.apache.hadoop.hbase.client.Durability, java.util.List, long, long, long, long) @bci=213, line=8309 (Compiled frame) Thread 14141: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 (Compiled frame) - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 (Compiled frame) - com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105 (Compiled frame) - com.lmax.disruptor.RingBuffer.next() @bci=4, line=263 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.lambda$stampSequenceIdAndPublishToRingBuffer$0(org.apache.commons.lang3.mutable.MutableLong, com.lmax.disruptor.RingBuffer) @bci=2, line=396 (Compiled fra me) - org.apache.hadoop.hbase.regionserver.wal.AbstractWAL$$Lambda$449.run() @bci=8 (Compiled frame) - org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.begin(java.lang.Runnable) @bci=36, line=144 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.stampSequenceIdAndPublishToRingBuffer(org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEd it, boolean, com.lmax.disruptor.RingBuffer) @bci=61, line=395 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, boolean) @bci=10, line=658 ( Compiled frame) Most handler threads seem to in one of these two states: 70x: Thread 13768: (state = BLOCKED) - org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.complete(org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl$WriteEntry) @bci=6, line=179 (Compiled frame) - org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(org.apache.hadoop.hbase.wal.WALEdit, org.apache.hadoop.hbase.client.Durability, java.util.List, long, long, long, long) @bci=250, line=8314 (Compiledframe) - org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation) @bci=272, line=4555 (Compiled frame) - org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation) @bci=52, line=4479 (Compiled frame) - org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.client.Mutation[], boolean, long, long) @bci=14, line=4409 (Compiled frame) 136x: Thread 13821: (state = BLOCKED) - org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.begin(java.lang.Runnable) @bci=7, line=141 (Interpreted frame) - org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.stampSequenceIdAndPublishToRingBuffer(org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, boolean, com.lmax.disruptor.RingBuffer) @bci=61, line=395 (Interpreted frame) - org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, boolean) @bci=10, line=658 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.appendData(org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit) @bci=5, line=991 (Compiled frame) - org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(org.apache.hadoop.hbase.wal.WALEdit, org.apache.hadoop.hbase.client.Durability, java.util.List, long, long, long, long) @bci=195, line=8306 (Compiledframe) Unfortunately, this situation never recovers on its own and we must restart the region server. Please advise on what we should investigate and could reconfigure? Kind regards Cornelius d-fine GmbH: Sitz der Gesellschaft: An der Hauptwache 7, 60313 Frankfurt am Main; Amtsgericht Frankfurt am Main: HRB Nr. 48 103; Geschaeftsfuehrung: Dr. Matthias Aicher, Christian Bangerl, Dr. Florian Baumann, Christoph Belafi, Dr. Christoph Bennemann, Dr. Eike Bick, Dr. Stellan Bohlens, Dr. Oliver Bohr, Dr. Jonas Braeuer, Dr. Christoph Burmester, Dr. Ashot Davtyan, Todor Dobrikov, Dr. Arndt Dombert, Dr. Uwe Doerr, Dr. Andreas Geyer, Dr. Robert Goerke, Dr. Ferdinand Graf, Dr. Oliver Hein, Dr. Stefan Heinrichs, Dr. Matthias Hirtschulz, Dr. Christian Hoerhammer, Dr. Arnd Huebsch, Dr. Sascha Huegle, Dr. Tilman Huhne, Ulf Henning Jacobs, Dr. Jan Jureit, Dr. Oliver Kayser-Herold, Dr. Andreas Keese, Dr. Jochen Kienert, Dr. Moritz Kiese, Dr. Henriette Kroener, Moritz von Medem, Dr. Florian Merz, Dr. Jochen Meyer, Dr. Karsten Meyer, Dr. Mathias Michel, Dr. Cornelius Mund, Dr. Christian Oehler, Dr. Ari Pankiewicz, Wolfgang Pleyer, Torsten Radtke, Dr. Joern Rank, Dr. Marco Rauch, Dr. Christian Romeike, Markus von Rothkirch, Dr. Egbert Schark, Dr. Christoph Schneggenburger, Dr. Sven Schulz, Nadja Schuster, Dr. Markus Seifert, Dr. Thorsten Sickenberger, Ewald Sinkevicius, Dr. Constantin Sobiella, Artur Steiner, Dr. Nico Taschenberger, Dr. Roland Uhlig, Dr. Hans Peter Waechter, Stefan Wei?er, Dr. Andreas Werner, Dr. Magnus Wobben This e-mail communication (and any attachment/s) is confidential and intended only for the individual(s) or entity named above and to others who have been specifically authorized to receive it. If you are not the intended recipient, please do not read, copy, use or disclose the contents of this communication to others. Please notify the sender that you have received this e-mail in error, and delete the e-mail (including any attachment/s) subsequently. This information may be subject to professional secrecy (e. g. of auditor, tax or legal advisor), other privilege or otherwise be protected by work product immunity or other legal rules. For more information about how and why we use personal information and who to contact with any queries about this, please consult our Data Privacy Policy<https://www.d-fine.com/en/service-navigation/data-privacy-policy/>. Thank you.