Dear HBase community,

We are on hbase 2.4.17 and are encountering an issue where a region server will 
have a great many of its handlers "stuck", in the sense that they are shown as 
active but nothing is really happening.

It looks similar to https://issues.apache.org/jira/browse/HBASE-28494 , though 
I am not exactly sure.
We have checked our HDFS and it looks fine and is not particularly busy.

    <name>hbase.wal.provider</name>
    <value>multiwal</value>
    <name>hbase.wal.regiongrouping.numgroups</name>
    <value>2</value>

It seems to start with this error:

2024-11-07 16:31:31,594 ERROR [MemStoreFlusher.7] regionserver.MemStoreFlusher: 
Cache flush failed for region xxx
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 300000 ms for txid=802200, WAL system stuck?
        at 
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:171)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.blockOnSync(AbstractWAL.java:246)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:690)

A thread dump after a short while then looks like this (excerpt):

Thread 13878: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may 
be imprecise)
- java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) 
@bci=20, line=215 (Compiled frame)
- 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long,
 java.util.concurrent.TimeUnit) @bci=97, line=2163 (Compiled frame)
- org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(long) @bci=26, 
line=169 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.blockOnSync(org.apache.hadoop.hbase.regionserver.wal.SyncFuture)
 @bci=26, line=246 (Compiled frame)
- org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(long, boolean) 
@bci=110, line=713 (Compiled frame)
- org.apache.hadoop.hbase.regionserver.HRegion.sync(long, 
org.apache.hadoop.hbase.client.Durability) @bci=100, line=8729 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(org.apache.hadoop.hbase.wal.WALEdit,
 org.apache.hadoop.hbase.client.Durability, java.util.List, long, long, long, 
long) @bci=213, line=8309 (Compiled
frame)

Thread 14141: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may 
be imprecise)
- java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
(Compiled frame)
- com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 
(Compiled frame)
- com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105 (Compiled 
frame)
- com.lmax.disruptor.RingBuffer.next() @bci=4, line=263 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.lambda$stampSequenceIdAndPublishToRingBuffer$0(org.apache.commons.lang3.mutable.MutableLong,
 com.lmax.disruptor.RingBuffer) @bci=2, line=396 (Compiled fra
me)
- org.apache.hadoop.hbase.regionserver.wal.AbstractWAL$$Lambda$449.run() @bci=8 
(Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.begin(java.lang.Runnable)
 @bci=36, line=144 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.stampSequenceIdAndPublishToRingBuffer(org.apache.hadoop.hbase.client.RegionInfo,
 org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEd
it, boolean, com.lmax.disruptor.RingBuffer) @bci=61, line=395 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(org.apache.hadoop.hbase.client.RegionInfo,
 org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, 
boolean) @bci=10, line=658 (
Compiled frame)

Most handler threads seem to in one of these two states:

70x:
Thread 13768: (state = BLOCKED)
- 
org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.complete(org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl$WriteEntry)
 @bci=6, line=179 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(org.apache.hadoop.hbase.wal.WALEdit,
 org.apache.hadoop.hbase.client.Durability, java.util.List, long, long, long, 
long) @bci=250, line=8314 (Compiledframe)
- 
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation)
 @bci=272, line=4555 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation)
 @bci=52, line=4479 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.client.Mutation[],
 boolean, long, long) @bci=14, line=4409 (Compiled frame)

 136x:
Thread 13821: (state = BLOCKED)
- 
org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.begin(java.lang.Runnable)
 @bci=7, line=141 (Interpreted frame)
- 
org.apache.hadoop.hbase.regionserver.wal.AbstractWAL.stampSequenceIdAndPublishToRingBuffer(org.apache.hadoop.hbase.client.RegionInfo,
 org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, 
boolean, com.lmax.disruptor.RingBuffer) @bci=61, line=395 (Interpreted frame)
- 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(org.apache.hadoop.hbase.client.RegionInfo,
 org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, 
boolean) @bci=10, line=658 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.appendData(org.apache.hadoop.hbase.client.RegionInfo,
 org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit) 
@bci=5, line=991 (Compiled frame)
- 
org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(org.apache.hadoop.hbase.wal.WALEdit,
 org.apache.hadoop.hbase.client.Durability, java.util.List, long, long, long, 
long) @bci=195, line=8306 (Compiledframe)

Unfortunately, this situation never recovers on its own and we must restart the 
region server.

Please advise on what we should investigate and could reconfigure?

Kind regards
Cornelius


d-fine GmbH: Sitz der Gesellschaft: An der Hauptwache 7, 60313 Frankfurt am 
Main; Amtsgericht Frankfurt am Main: HRB Nr. 48 103; Geschaeftsfuehrung: Dr. 
Matthias Aicher, Christian Bangerl, Dr. Florian Baumann, Christoph Belafi, Dr. 
Christoph Bennemann, Dr. Eike Bick, Dr. Stellan Bohlens, Dr. Oliver Bohr, Dr. 
Jonas Braeuer, Dr. Christoph Burmester, Dr. Ashot Davtyan, Todor Dobrikov, Dr. 
Arndt Dombert, Dr. Uwe Doerr, Dr. Andreas Geyer, Dr. Robert Goerke, Dr. 
Ferdinand Graf, Dr. Oliver Hein, Dr. Stefan Heinrichs, Dr. Matthias Hirtschulz, 
Dr. Christian Hoerhammer, Dr. Arnd Huebsch, Dr. Sascha Huegle, Dr. Tilman 
Huhne, Ulf Henning Jacobs, Dr. Jan Jureit, Dr. Oliver Kayser-Herold, Dr. 
Andreas Keese, Dr. Jochen Kienert, Dr. Moritz Kiese, Dr. Henriette Kroener, 
Moritz von Medem, Dr. Florian Merz, Dr. Jochen Meyer, Dr. Karsten Meyer, Dr. 
Mathias Michel, Dr. Cornelius Mund, Dr. Christian Oehler, Dr. Ari Pankiewicz, 
Wolfgang Pleyer, Torsten Radtke, Dr. Joern Rank, Dr. Marco Rauch, Dr. Christian 
Romeike, Markus von Rothkirch, Dr. Egbert Schark, Dr. Christoph 
Schneggenburger, Dr. Sven Schulz, Nadja Schuster, Dr. Markus Seifert, Dr. 
Thorsten Sickenberger, Ewald Sinkevicius, Dr. Constantin Sobiella, Artur 
Steiner, Dr. Nico Taschenberger, Dr. Roland Uhlig, Dr. Hans Peter Waechter, 
Stefan Wei?er, Dr. Andreas Werner, Dr. Magnus Wobben

This e-mail communication (and any attachment/s) is confidential and intended 
only for the individual(s) or entity named above and to others who have been 
specifically authorized to receive it. If you are not the intended recipient, 
please do not read, copy, use or disclose the contents of this communication to 
others. Please notify the sender that you have received this e-mail in error, 
and delete the e-mail (including any attachment/s) subsequently. This 
information may be subject to professional secrecy (e. g. of auditor, tax or 
legal advisor), other privilege or otherwise be protected by work product 
immunity or other legal rules.
For more information about how and why we use personal information and who to 
contact with any queries about this, please consult our Data Privacy 
Policy<https://www.d-fine.com/en/service-navigation/data-privacy-policy/>.
Thank you.

Reply via email to