Junegunn Choi created HDFS-17620:
------------------------------------

             Summary: Better block placement for small EC files
                 Key: HDFS-17620
                 URL: https://issues.apache.org/jira/browse/HDFS-17620
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: erasure-coding, namenode
    Affects Versions: 3.3.6
            Reporter: Junegunn Choi


h2. Problem description

If an erasure-coded file is not large enough to fill the stripe width of the EC 
policy, the block distribution can be suboptimal.

For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data block 
and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen by the 
block placement policy, only 4 of them are used, and the last 3 locations are 
for parity blocks. If the cluster has a very small number of racks (e.g. 3), 
with the current scheme to find a pipeline with the shortest path, the last 
nodes are likely to be in the same rack, resulting in a suboptimal rack 
distribution.
{noformat}
Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9
    Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3
   Blocks: D1                P1 P2 P3
{noformat}
We can see that blocks are stored in only 2 racks, not 3.

Because the block does not have enough racks, {{ErasureCodingWork}} will later 
be created to replicate the block to a new rack, however, the current code 
tries to copy the block to the first node in the chosen locations, regardless 
of its rack. So it is not guaranteed to improve the situation, and we 
constantly see {{PendingReconstructionMonitor timed out}} messages in the log.
h2. Proposed solution

1. Reorder the chosen locations by rack so that the parity blocks are stored in 
as many racks as possible.
2. Make {{ErasureCodingWork}} try to find a target on a new rack
h2. Real-world test result

We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 
nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data 
directory, we noticed that

1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
{noformat}
  /hbase/***:  Replica placement policy is violated for ***. Block should be 
additionally replicated on 2 more rack(s). Total number of racks in the 
cluster: 3
  ...

  Erasure Coded Block Groups:
    ...
    Unsatisfactory placement block groups: 1475 (2.5252092 %)
  {noformat}
2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages 
every recheck-interval  (5 minutes).
3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every 
recheck-interval.

After applying the patch, all the problems are gone. "Unsatisfactory placement 
block groups" is now zero. No metrics bumps or "timed out" logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to