Re: Replication Error in HBase Production Environment

Hamado Dene Wed, 18 Sep 2024 00:16:35 -0700

I did some investigations, and the WALs seem to be readable without any 
issues... One strange thing I noticed is that the WALs are very old... they are 
1 year older than the current date.


-rw-r--r-- 2 hbase hadoop 42594304 2023-10-09 08:27 
/hbase/oldWALs/rzv-db09-hd.xxxx%2C16020%2C1674973354605.1696810476448
-rw-r--r-- 2 hbase hadoop 13622784 2023-10-09 08:26 
/hbase/oldWALs/rzv-db10-hd.xxxx%2C16020%2C1674973984596.1696810895708
-rw-r--r-- 2 hbase hadoop 15872 2023-10-09 08:26 
/hbase/oldWALs/rzv-db12-hd.xxxx%2C16020%2C1674973371058.1696813278286
-rw-r--r-- 2 hbase hadoop 15007744 2023-10-09 08:26 
/hbase/oldWALs/rzv-db13-hd.xxxx%2C16020%2C1684871532555.1696811057371
-rw-r--r-- 2 hbase hadoop 20684288 2023-10-09 08:26 
/hbase/oldWALs/rzv-db14-hd.xxxx%2C16020%2C1674973593505.1696810047993

the current date is 
Wed Sep 18 09:06:17 CEST 2024

But the log date is October 09 of 2023

Could this be the cause of the issue?

Hamado Dene 

    Il lunedì 16 settembre 2024 alle ore 16:37:12 CEST, Hamado Dene 
<[email protected]> ha scritto:  
 
 
I deduced that it was one of the old WALs because, from the UI, I see that 
these old WALs are not being replicated. However, I'll do another round of 
checks to see if I can find something more. Would enabling debug help me find 
more information?

Thanks again for your help.


Replication Status
   
   - Current Log
   - Replication Delay

| PeerId | WalGroup | Current Log | Size | Queue Size | Offset |
| replicav3 | rzv-db10-hd.xxxx%2C16020%2C1674973984596 | 
hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db10-hd.xxxx%2C16020%2C1674973984596.1696810895708
 | 13.0 M | 1 | -1 |
| replicav3 | rzv-db12-hd.xxxx%2C16020%2C1726056192276 | 
hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db12-hd.xxxxx,16020,1726056192276/rzv-db12-hd.xxxxxx%2C16020%2C1726056192276.1726495470091
 | 0 | 1 | 98.0 M

 |




Replication Status
   
   - Current Log
   - Replication Delay

| PeerId | WalGroup | Current Log | Size | Queue Size | Offset |
| replicav3 | rzv-db10-hd.xxxx%2C16020%2C1726056520723 | 
hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db10-hd.xxxx,16020,1726056520723/rzv-db10-hd.rozzano.diennea.lan%2C16020%2C1726056520723.1726495461864
 | 0 | 1 | 4.9 M |
| replicav3 | rzv-db14-hd.xxxxn%2C16020%2C1674973593505 | 
hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db14-hd.rozzano.diennea.lan%2C16020%2C1674973593505.1696810047993
 | 19.7 M | 1 | -1 |




Replication Status
   
   - Current Log
   - Replication Delay

| PeerId | WalGroup | Current Log | Size | Queue Size | Offset |
| replicav3 | rzv-db11-hd.xxxx%2C16020%2C1726063232272 | 
hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db11-hd.rozzano.diennea.lan,16020,1726063232272/rzv-db11-hd.rozzano.diennea.lan%2C16020%2C1726063232272.1726495580356
 | 0 | 1 | 16.8 M |
| replicav3 | rzv-db12-hd.xxx%2C16020%2C1674973371058 | 
hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db12-hd.rozzano.diennea.lan%2C16020%2C1674973371058.1696813278286
 | 15.5 K | 1 | -1 |



Replication Status
   
   - Current Log
   - Replication Delay

| PeerId | WalGroup | Current Log | Size | Queue Size | Offset |
| replicav3 | rzv-db09-hd.xxxx%2C16020%2C1674973354605 | 
hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1674973354605.1696810476448
 | 40.6 M | 1 | -1 |
| replicav3 | rzv-db14-hd.xxx%2C16020%2C1726066551699 | 
hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db14-hd.rozzano.diennea.lan,16020,1726066551699/rzv-db14-hd.rozzano.diennea.lan%2C16020%2C1726066551699.1726496170126
 | 0 | 1 | 7.9 M |

    Il lunedì 16 settembre 2024 alle ore 16:11:19 CEST, 张铎(Duo Zhang) 
<[email protected]> ha scritto:  
 
 The staktrace you posted is messed up so it is not easy to find out
which file actually blocks the replication progress...

Could you please double check the WAL file which blocks the
replication? Is it really one of these old WAL files?

Thanks.

Hamado Dene <[email protected]> 于2024年9月16日周一 21:57写道：
>
> Thanks for your response.
> If I try to read the WALs with the following command:
> hbase org.apache.hadoop.hbase.wal.WALPrettyPrinter 
> /hbase/oldWALs/rzv-db13-hd.xxxx%2C16020%2C1684871532555.1696811057371
> I don't get any error... The file seems to be read correctly. In fact, at the 
> end of the reading, something like the following is printed:
>
> cell total size sum: 136edit heap size: 312position: 15007544```"
>
>
> Thanks,
>
>    Il lunedì 16 settembre 2024 alle ore 14:51:02 CEST, 张铎(Duo Zhang) 
><[email protected]> ha scritto:
>
>  Have you tried to read these WAL files by WALPrettyPrinter? What is
> the error from WALPrettyPrinter while reading these files?
>
> Hamado Dene <[email protected]> 于2024年9月16日周一 16:15写道：
> >
> > Checking the WALs on HDFS, there are very old WALs, from a year ago... Does 
> > anyone have any idea how to handle this issue in production?
> >
> > -rw-r--r--  2 hbase hadoop  20684288 2023-10-09 08:26 
> > /hbase/oldWALs/rzv-db14-hd.xxxx%2C16020%2C1674973593505.1696810047993
> > -rw-r--r--  2 hbase hadoop  15007744 2023-10-09 08:26 
> > /hbase/oldWALs/rzv-db13-hd.xxxx%2C16020%2C1684871532555.1696811057371
> > -rw-r--r--  2 hbase hadoop      15872 2023-10-09 08:26 
> > /hbase/oldWALs/rzv-db12-hd.xxxx%2C16020%2C1674973371058.1696813278286
> > -rw-r--r--  2 hbase hadoop  42594304 2023-10-09 08:27 
> > /hbase/oldWALs/rzv-db09-hd.xxxx%2C16020%2C1674973354605.1696810476448-rw-r--r--
> >   2 hbase hadoop  13622784 2023-10-09 08:26 
> > /hbase/oldWALs/rzv-db10-hd.xxxx%2C16020%2C1674973984596.1696810895708
> >    Il giovedì 12 settembre 2024 alle ore 09:30:46 CEST, Hamado Dene 
> ><[email protected]> ha scritto:
> >
> >  Hi community,Could anyone kindly assist me in resolving this issue I'm 
> >facing?
> > Thank you in advance!
> > Hamado Dene
> >    Il mercoledì 11 settembre 2024 alle ore 16:26:55 CEST, Hamado Dene 
> ><[email protected]> ha scritto:
> >
> >  Hi HBase Community,
> > We are currently facing an issue in our production environment with HBase 
> > replication, and I would greatly appreciate any guidance or suggestions the 
> > community may have
> >
> > We are running HBase version 2.5.8, and in the logs, we consistently 
> > encounter the following warning:
> >
> >
> >
> > 024-09-11T15:51:11,468 WARN  
> > [RS_CLAIM_REPLICATION_QUEUE-regionserver/rzv-db09-hd:16020-0.replicationSource,replicav3-rzv-db13-hd.xxxx,16020,1684871532555-rzv-db09-hd.xxxx,16020,1696832789107-rzv-db09-hd.xxxx,16020,1696833033289-rzv-db13-hd.xxxx,16020,1722636062425-rzv-db13-hd.xxxx,16020,1722636803794-rzv-db12-hd.xxxx,16020,1722636800268.replicationSource.wal-reader.rzv-db13-hd.xxxx%2C16020%2C1684871532555,replicav3-rzv-db13-hd.xxxx,16020,1684871532555-rzv-db09-hd.xxxx,16020,1696832789107-rzv-db09-hd.xxxx,16020,1696833033289-rzv-db13-hd.xxxx,16020,1722636062425-rzv-db13-hd.xxxx,16020,1722636803794-rzv-db12-hd.xxxx,16020,1722636800268]
> >  regionserver.ReplicationSourceWALReader: Failed to read stream of 
> > replication entriesjava.io.EOFException: Cannot seek after EOF        at 
> > org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1682) 
> > ~[hadoop-hdfs-client-2.10.2.jar:?]        at 
> > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:66) 
> > ~[hadoop-common-2.10.2.jar:?]        at 
> > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.seekOnFs(ProtobufLogReader.java:527)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.seek(ReaderBase.java:130)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.seek(WALEntryStream.java:408)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:339)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:308)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:298)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.tryAdvanceStreamAndCreateWALBatch(ReplicationSourceWALReader.java:258)
> >  ~[hbase-server-2.5.8.jar:2.5.8]        at 
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:145)
> >  ~[hbase-server-2.5.8.jar:2.5.8]
> >
> >
> > This error appears to stem from the replication WAL reader, and the "Cannot 
> > seek after EOF" message suggests a failure to read the replication entries. 
> > We suspect this may be affecting the replication flow between our region 
> > servers.
> >
> > Has anyone encountered this problem before, or does anyone have insights 
> > into potential causes and solutions?
> >
> >
> > Thank you in advance for your assistance!
> >
> > Hamado Dene
>

Re: Replication Error in HBase Production Environment

Reply via email to