Junegunn Choi created HDFS-17690:
------------------------------------

             Summary: Avoid redundant EC reconstruction tasks after pending 
reconstruction timeouts
                 Key: HDFS-17690
                 URL: https://issues.apache.org/jira/browse/HDFS-17690
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: ec
    Affects Versions: 3.4.1
            Reporter: Junegunn Choi
         Attachments: image-2024-12-18-13-50-11-978.png, 
image-2024-12-18-13-51-33-552.png

h2. Problem

We are running HDFS clusters with RS-6-3-1024k policy enabled.

When datanodes go down, HDFS reconstructs the under-replicated EC blocks as 
expected and the number of low redundancy blocks decreases. However, after some 
period of time, we are observing a lot of wasted effort in reconstructing EC 
blocks that are already fully reconstructed for a very long time.

!image-2024-12-18-13-50-11-978.png!
 * We shutted down two datanodes from an HDFS cluster with 21 datanodes.
 * Each datanode holds around 2TBs of data on SSD devices.
 * It took 10 hours for the low redundancy metric to reach 0.
 * Lots of (17K+) failed reconstruction tasks, wasting disk and network 
resources during the period.
 ** (There were no other activities on the cluster)

This is the log messages for {{{}blk_-9223372036851532480_352520{}}}, and we 
can see repeated {{"Failed to reconstruct"}} messages on different datanodes.
{noformat}
@timestamp,server,loglevel,role,data
Dec 18, 2024 @ 00:56:07.000,s19,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 01:03:41.000,s04,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 04:56:16.000,s20,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 05:18:37.000,s20,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 07:18:31.000,s20,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 07:38:34.000,s20,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:18:28.000,s19,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:18:40.000,s19,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:19:07.000,s19,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:23:08.000,s18,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:23:38.000,s18,WARN,datanode,Failed to reconstruct striped 
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
{noformat}
The repeated failures were caused by {{{}ReplicaAlreadyExistsException{}}}, 
which means we were performing redundant tasks.
{noformat}
24/12/18 08:23:59 INFO datanode.DataNode: DatanodeCommand action: 
DNA_ERASURE_CODING_RECOVERY
24/12/18 08:23:59 INFO datanode.DataNode: Receiving 
BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 src: /s18:35816 
dest: /s18:1004
24/12/18 08:23:59 INFO datanode.DataNode: opWriteBlock 
BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already 
exists in state FINALIZED and thus cannot be created.
24/12/18 08:23:59 INFO datanode.DataNode: s18:1004:DataXceiver error processing 
WRITE_BLOCK operation  src: /s18:35816 dst: /s18:1004; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already exists 
in state FINALIZED and thus cannot be created.
24/12/18 08:23:59 WARN datanode.DataNode: Broken pipe
24/12/18 08:23:59 WARN datanode.DataNode: Failed to reconstruct striped block: 
BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
java.io.IOException: Transfer failed for all targets.
        at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:118)
        at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:63)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{noformat}
{{hdfs fsck -blockId blk_-9223372036851532480}} also shows that this block 
doesn't need reconstruction.
{noformat}
No. of Expected Replica: 9
No. of live Replica: 9
{noformat}
h2. Workaround

Because this happens only after several {{PendingReconstructionMonitor timed 
out}} messages, I tried increasing 
{{dfs.namenode.reconstruction.pending.timeout-sec}} so no timeout occurs, and 
the problem disappeared.
h2. Cause

When an EC reconstruction task for a block is scheduled, 
{{BlockECReconstructionInfo}} for the block is added to the 
{{erasurecodeBlocks}} queue of the {{DatanodeDescriptor}} for the target node. 
However, it is not removed from the queue when the reconstruction task is timed 
out, recreated, and rescheduled. And we end up with multiple redundant 
{{BlockECReconstructionInfo}} objects on the queues of several datanodes.

This explains a lot of wasted effort after multiple 
PendingReconstructionMonitor timeouts.

You're more likely to experience this problem on a relatively small cluster 
with with small {{dfs.namenode.replication.max-streams}} value, where block 
reconstruction for failed datanodes would take considerable amount of time and 
trigger many pending reconstruction timeouts. This was exactly the case for us; 
our cluster has only 21 datanodes and {{dfs.namenode.replication.max-streams}} 
is set to 1 because we're running HBase on the same cluster and don't want 
eager reconstruction to affect HBase's performance.
h2. Suggested fix

Avoid processing redundant {{BlockECReconstructionInfo}} from the queue. Check 
again if the task is really needed just before we dispatch the tasks to the 
datanodes.
h2. Result

!image-2024-12-18-13-51-33-552.png!
 * Took down 2 datanodes
 * Only took 3.5 hours to complete the reconstruction
 * 10K+ {{"doesn't need reconstruction"}} messages in the namenode log, showing 
that we have successfully avoided dispatching redundant tasks.
 * No {{org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException}} 
errors

h2. Alternative approach considered

We can instead remove the {{BlockECReconstructionInfo}} from the queue when the 
task is timed out, but this would require more bookkeeping and is more 
error-prone.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to