tomscut created HDFS-16507:
------------------------------

             Summary: Purged edit logs which is in process
                 Key: HDFS-16507
                 URL: https://issues.apache.org/jira/browse/HDFS-16507
             Project: Hadoop HDFS
          Issue Type: Bug
         Environment: {code:java}
// code placeholder
{code}
            Reporter: tomscut


We introduced Standby read functionality in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the Editlog to be purged does not 
finalize normally .

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#FF0000}edits_InProgresS_0000000000024207987{color}.

 
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

 

{color:#FF0000}25892513 + 1 - 1000000 = 24892514{color}
{color:#FF0000}dfs.namenode.num.extra.edits.retained=1000000{color}

 
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 20000
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_0000000000024794305,
 cpktTxId=0000000000024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_0000000000027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 
0.343 seconds
2022-03-15 17:28:03,640 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_0000000000027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,684 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 
0.148 seconds
2022-03-15 17:28:03,748 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_0000000000027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,798 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 
0.113 seconds
2022-03-15 17:28:03,798 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully.
 {code}
3. ANN. Purge edit logs.

 

{color:#FF0000}25892513 + 1 - 1000000 = 24892514{color}
{color:#FF0000}dfs.namenode.num.extra.edits.retained=1000000{color}
{code:java}
2022-03-15 17:28:03,515 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513 {code}
{code:java}
2022-03-15 17:28:03,523 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 
2022-03-15 17:28:03,523 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:purgeLogsOlderThan(458)) - Purging remote journals 
older than txid 24892514 2022-03-15 17:28:03,524 INFO  
namenode.FileJournalManager (FileJournalManager.java:purgeLogsOlderThan(184)) - 
Purging logs older than 24892514
{code}
4. ANN. Purge {color:#FF0000}edits_inprogress_0000000000024207987{color}.
{code:java}
2022-03-15 17:28:03,525 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeLog(227)) - Purging old edit log 
EditLogFile(file=/data/hadoop/hdfs/namenode/current/edits_inprogress_0000000000024207987,first=0000000000024207987,last=-000000000000012345,inProgress=true,hasCorruptHeader=false)
 {code}
5. JN. Purge edit logs include {color:#FF0000}24207987{color}.

{color:#FF0000}No log file to finalize at transaction ID 24207987{color}
{code:java}
2022-03-15 17:28:03,530 INFO  common.Storage 
(JNStorage.java:purgeMatching(196)) - Purging no-longer needed file 24207986
2022-03-15 17:28:03,564 INFO  common.Storage 
(JNStorage.java:purgeMatching(196)) - Purging no-longer needed file 24207987
2022-03-15 17:28:03,571 INFO  common.Storage 
(JNStorage.java:purgeMatching(196)) - Purging no-longer needed file 24207981
2022-03-15 17:28:03,572 INFO  common.Storage 
(JNStorage.java:purgeMatching(196)) - Purging no-longer needed file 24207984
2022-03-15 17:28:03,576 INFO  common.Storage 
(JNStorage.java:purgeMatching(196)) - Purging no-longer needed file 24207982
org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException: No log file 
to finalize at transaction ID 24207987 ; journal id: ambari-test
 {code}
6. ANN. Process exits because of the fatal exception from JNs.
{code:java}
2022-03-15 17:28:52,709 INFO  namenode.FSNamesystem 
(FSNamesystem.java:run(4232)) - NameNode rolling its own edit log because 
number of edits in open segment exceeds threshold of 100000
2022-03-15 17:28:52,709 INFO  namenode.FSEditLog 
(FSEditLog.java:rollEditLog(1346)) - Rolling edit logs
2022-03-15 17:28:52,709 INFO  namenode.FSEditLog 
(FSEditLog.java:endCurrentLogSegment(1442)) - Ending log segment 24207987, 
27990691
2022-03-15 17:28:52,748 INFO  namenode.FSEditLog 
(FSEditLog.java:printStatistics(806)) - Number of transactions: 3782706 Total 
time for transactions(ms): 25003 Number of transactions batched in Syncs:
3774095 Number of syncs: 8611 SyncTimes(ms): 11347 151932
2022-03-15 17:28:52,867 FATAL namenode.FSEditLog 
(JournalSet.java:mapJournalsAndReportErrors(393)) - Error: finalize log segment 
24207987, 27990692 failed for required journal (JournalAndStream(mgr=QJM
 to [10.152.106.55:8485, 10.152.115.74:8485, 10.152.124.150:8485, 
10.152.124.157:8485, 10.152.106.57:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions 
to achieve quorum size 3/5. 5 exceptions thrown:
10.152.124.157:8485: No log file to finalize at transaction ID 24207987 ; 
journal id: ambari-test
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.finalizeLogSegment(Journal.java:656)
        at 
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.finalizeLogSegment(JournalNodeRpcServer.java:210)
        at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.finalizeLogSegment(QJournalProtocolServerSideTranslatorPB.java:205)
        at 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:28890)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:550)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1094)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1066)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1000)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2989) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to