Junegunn Choi created HDFS-17690: ------------------------------------ Summary: Avoid redundant EC reconstruction tasks after pending reconstruction timeouts Key: HDFS-17690 URL: https://issues.apache.org/jira/browse/HDFS-17690 Project: Hadoop HDFS Issue Type: Improvement Components: ec Affects Versions: 3.4.1 Reporter: Junegunn Choi Attachments: image-2024-12-18-13-50-11-978.png, image-2024-12-18-13-51-33-552.png
h2. Problem We are running HDFS clusters with RS-6-3-1024k policy enabled. When datanodes go down, HDFS reconstructs the under-replicated EC blocks as expected and the number of low redundancy blocks decreases. However, after some period of time, we are observing a lot of wasted effort in reconstructing EC blocks that are already fully reconstructed for a very long time. !image-2024-12-18-13-50-11-978.png! * We shutted down two datanodes from an HDFS cluster with 21 datanodes. * Each datanode holds around 2TBs of data on SSD devices. * It took 10 hours for the low redundancy metric to reach 0. * Lots of (17K+) failed reconstruction tasks, wasting disk and network resources during the period. ** (There were no other activities on the cluster) This is the log messages for {{{}blk_-9223372036851532480_352520{}}}, and we can see repeated {{"Failed to reconstruct"}} messages on different datanodes. {noformat} @timestamp,server,loglevel,role,data Dec 18, 2024 @ 00:56:07.000,s19,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 01:03:41.000,s04,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 04:56:16.000,s20,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 05:18:37.000,s20,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 07:18:31.000,s20,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 07:38:34.000,s20,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 08:18:28.000,s19,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 08:18:40.000,s19,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 08:19:07.000,s19,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 08:23:08.000,s18,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 Dec 18, 2024 @ 08:23:38.000,s18,WARN,datanode,Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 {noformat} The repeated failures were caused by {{{}ReplicaAlreadyExistsException{}}}, which means we were performing redundant tasks. {noformat} 24/12/18 08:23:59 INFO datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY 24/12/18 08:23:59 INFO datanode.DataNode: Receiving BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 src: /s18:35816 dest: /s18:1004 24/12/18 08:23:59 INFO datanode.DataNode: opWriteBlock BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already exists in state FINALIZED and thus cannot be created. 24/12/18 08:23:59 INFO datanode.DataNode: s18:1004:DataXceiver error processing WRITE_BLOCK operation src: /s18:35816 dst: /s18:1004; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already exists in state FINALIZED and thus cannot be created. 24/12/18 08:23:59 WARN datanode.DataNode: Broken pipe 24/12/18 08:23:59 WARN datanode.DataNode: Failed to reconstruct striped block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 java.io.IOException: Transfer failed for all targets. at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:118) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} {{hdfs fsck -blockId blk_-9223372036851532480}} also shows that this block doesn't need reconstruction. {noformat} No. of Expected Replica: 9 No. of live Replica: 9 {noformat} h2. Workaround Because this happens only after several {{PendingReconstructionMonitor timed out}} messages, I tried increasing {{dfs.namenode.reconstruction.pending.timeout-sec}} so no timeout occurs, and the problem disappeared. h2. Cause When an EC reconstruction task for a block is scheduled, {{BlockECReconstructionInfo}} for the block is added to the {{erasurecodeBlocks}} queue of the {{DatanodeDescriptor}} for the target node. However, it is not removed from the queue when the reconstruction task is timed out, recreated, and rescheduled. And we end up with multiple redundant {{BlockECReconstructionInfo}} objects on the queues of several datanodes. This explains a lot of wasted effort after multiple PendingReconstructionMonitor timeouts. You're more likely to experience this problem on a relatively small cluster with with small {{dfs.namenode.replication.max-streams}} value, where block reconstruction for failed datanodes would take considerable amount of time and trigger many pending reconstruction timeouts. This was exactly the case for us; our cluster has only 21 datanodes and {{dfs.namenode.replication.max-streams}} is set to 1 because we're running HBase on the same cluster and don't want eager reconstruction to affect HBase's performance. h2. Suggested fix Avoid processing redundant {{BlockECReconstructionInfo}} from the queue. Check again if the task is really needed just before we dispatch the tasks to the datanodes. h2. Result !image-2024-12-18-13-51-33-552.png! * Took down 2 datanodes * Only took 3.5 hours to complete the reconstruction * 10K+ {{"doesn't need reconstruction"}} messages in the namenode log, showing that we have successfully avoided dispatching redundant tasks. * No {{org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException}} errors h2. Alternative approach considered We can instead remove the {{BlockECReconstructionInfo}} from the queue when the task is timed out, but this would require more bookkeeping and is more error-prone. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org