I deduced that it was one of the old WALs because, from the UI, I see that these old WALs are not being replicated. However, I'll do another round of checks to see if I can find something more. Would enabling debug help me find more information?
Thanks again for your help. Replication Status - Current Log - Replication Delay | PeerId | WalGroup | Current Log | Size | Queue Size | Offset | | replicav3 | rzv-db10-hd.xxxx%2C16020%2C1674973984596 | hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db10-hd.xxxx%2C16020%2C1674973984596.1696810895708 | 13.0 M | 1 | -1 | | replicav3 | rzv-db12-hd.xxxx%2C16020%2C1726056192276 | hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db12-hd.xxxxx,16020,1726056192276/rzv-db12-hd.xxxxxx%2C16020%2C1726056192276.1726495470091 | 0 | 1 | 98.0 M | Replication Status - Current Log - Replication Delay | PeerId | WalGroup | Current Log | Size | Queue Size | Offset | | replicav3 | rzv-db10-hd.xxxx%2C16020%2C1726056520723 | hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db10-hd.xxxx,16020,1726056520723/rzv-db10-hd.rozzano.diennea.lan%2C16020%2C1726056520723.1726495461864 | 0 | 1 | 4.9 M | | replicav3 | rzv-db14-hd.xxxxn%2C16020%2C1674973593505 | hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db14-hd.rozzano.diennea.lan%2C16020%2C1674973593505.1696810047993 | 19.7 M | 1 | -1 | Replication Status - Current Log - Replication Delay | PeerId | WalGroup | Current Log | Size | Queue Size | Offset | | replicav3 | rzv-db11-hd.xxxx%2C16020%2C1726063232272 | hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db11-hd.rozzano.diennea.lan,16020,1726063232272/rzv-db11-hd.rozzano.diennea.lan%2C16020%2C1726063232272.1726495580356 | 0 | 1 | 16.8 M | | replicav3 | rzv-db12-hd.xxx%2C16020%2C1674973371058 | hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db12-hd.rozzano.diennea.lan%2C16020%2C1674973371058.1696813278286 | 15.5 K | 1 | -1 | Replication Status - Current Log - Replication Delay | PeerId | WalGroup | Current Log | Size | Queue Size | Offset | | replicav3 | rzv-db09-hd.xxxx%2C16020%2C1674973354605 | hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1674973354605.1696810476448 | 40.6 M | 1 | -1 | | replicav3 | rzv-db14-hd.xxx%2C16020%2C1726066551699 | hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db14-hd.rozzano.diennea.lan,16020,1726066551699/rzv-db14-hd.rozzano.diennea.lan%2C16020%2C1726066551699.1726496170126 | 0 | 1 | 7.9 M | Il lunedì 16 settembre 2024 alle ore 16:11:19 CEST, 张铎(Duo Zhang) <palomino...@gmail.com> ha scritto: The staktrace you posted is messed up so it is not easy to find out which file actually blocks the replication progress... Could you please double check the WAL file which blocks the replication? Is it really one of these old WAL files? Thanks. Hamado Dene <hamadod...@yahoo.com.invalid> 于2024年9月16日周一 21:57写道: > > Thanks for your response. > If I try to read the WALs with the following command: > hbase org.apache.hadoop.hbase.wal.WALPrettyPrinter > /hbase/oldWALs/rzv-db13-hd.xxxx%2C16020%2C1684871532555.1696811057371 > I don't get any error... The file seems to be read correctly. In fact, at the > end of the reading, something like the following is printed: > > cell total size sum: 136edit heap size: 312position: 15007544```" > > > Thanks, > > Il lunedì 16 settembre 2024 alle ore 14:51:02 CEST, 张铎(Duo Zhang) ><palomino...@gmail.com> ha scritto: > > Have you tried to read these WAL files by WALPrettyPrinter? What is > the error from WALPrettyPrinter while reading these files? > > Hamado Dene <hamadod...@yahoo.com.invalid> 于2024年9月16日周一 16:15写道: > > > > Checking the WALs on HDFS, there are very old WALs, from a year ago... Does > > anyone have any idea how to handle this issue in production? > > > > -rw-r--r-- 2 hbase hadoop 20684288 2023-10-09 08:26 > > /hbase/oldWALs/rzv-db14-hd.xxxx%2C16020%2C1674973593505.1696810047993 > > -rw-r--r-- 2 hbase hadoop 15007744 2023-10-09 08:26 > > /hbase/oldWALs/rzv-db13-hd.xxxx%2C16020%2C1684871532555.1696811057371 > > -rw-r--r-- 2 hbase hadoop 15872 2023-10-09 08:26 > > /hbase/oldWALs/rzv-db12-hd.xxxx%2C16020%2C1674973371058.1696813278286 > > -rw-r--r-- 2 hbase hadoop 42594304 2023-10-09 08:27 > > /hbase/oldWALs/rzv-db09-hd.xxxx%2C16020%2C1674973354605.1696810476448-rw-r--r-- > > 2 hbase hadoop 13622784 2023-10-09 08:26 > > /hbase/oldWALs/rzv-db10-hd.xxxx%2C16020%2C1674973984596.1696810895708 > > Il giovedì 12 settembre 2024 alle ore 09:30:46 CEST, Hamado Dene > ><hamadod...@yahoo.com> ha scritto: > > > > Hi community,Could anyone kindly assist me in resolving this issue I'm > >facing? > > Thank you in advance! > > Hamado Dene > > Il mercoledì 11 settembre 2024 alle ore 16:26:55 CEST, Hamado Dene > ><hamadod...@yahoo.com> ha scritto: > > > > Hi HBase Community, > > We are currently facing an issue in our production environment with HBase > > replication, and I would greatly appreciate any guidance or suggestions the > > community may have > > > > We are running HBase version 2.5.8, and in the logs, we consistently > > encounter the following warning: > > > > > > > > 024-09-11T15:51:11,468 WARN > > [RS_CLAIM_REPLICATION_QUEUE-regionserver/rzv-db09-hd:16020-0.replicationSource,replicav3-rzv-db13-hd.xxxx,16020,1684871532555-rzv-db09-hd.xxxx,16020,1696832789107-rzv-db09-hd.xxxx,16020,1696833033289-rzv-db13-hd.xxxx,16020,1722636062425-rzv-db13-hd.xxxx,16020,1722636803794-rzv-db12-hd.xxxx,16020,1722636800268.replicationSource.wal-reader.rzv-db13-hd.xxxx%2C16020%2C1684871532555,replicav3-rzv-db13-hd.xxxx,16020,1684871532555-rzv-db09-hd.xxxx,16020,1696832789107-rzv-db09-hd.xxxx,16020,1696833033289-rzv-db13-hd.xxxx,16020,1722636062425-rzv-db13-hd.xxxx,16020,1722636803794-rzv-db12-hd.xxxx,16020,1722636800268] > > regionserver.ReplicationSourceWALReader: Failed to read stream of > > replication entriesjava.io.EOFException: Cannot seek after EOF at > > org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1682) > > ~[hadoop-hdfs-client-2.10.2.jar:?] at > > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:66) > > ~[hadoop-common-2.10.2.jar:?] at > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.seekOnFs(ProtobufLogReader.java:527) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.seek(ReaderBase.java:130) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.seek(WALEntryStream.java:408) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:339) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:308) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:298) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.tryAdvanceStreamAndCreateWALBatch(ReplicationSourceWALReader.java:258) > > ~[hbase-server-2.5.8.jar:2.5.8] at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:145) > > ~[hbase-server-2.5.8.jar:2.5.8] > > > > > > This error appears to stem from the replication WAL reader, and the "Cannot > > seek after EOF" message suggests a failure to read the replication entries. > > We suspect this may be affecting the replication flow between our region > > servers. > > > > Has anyone encountered this problem before, or does anyone have insights > > into potential causes and solutions? > > > > > > Thank you in advance for your assistance! > > > > Hamado Dene >