[jira] [Created] (HDFS-17599) Fix the mismatch between locations and indices for mover

Tao Li (Jira) Sat, 03 Aug 2024 03:07:19 -0700

Tao Li created HDFS-17599:
-----------------------------

             Summary: Fix the mismatch between locations and indices for mover
                 Key: HDFS-17599
                 URL: https://issues.apache.org/jira/browse/HDFS-17599
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 3.4.0, 3.3.0
            Reporter: Tao Li
            Assignee: Tao Li
         Attachments: image-2024-08-03-17-59-08-059.png, 
image-2024-08-03-18-00-01-950.png


We set the EC policy to (6+3) and also have nodes that were in state 
ENTERING_MAINTENANCE.
 
When we move the data of some directories from SSD to HDD, some blocks move 
fail due to disk full, as shown in the figure below (blk_-9223372033441574269).
We tried to move again and found the following error "{color:#FF0000}Replica 
does not exist{color}".
Observing the information of fsck, it can be found that the wrong 
blockid(blk_-9223372033441574270) was found when moving block.
 
{*}Mover Logs{*}:
!image-2024-08-03-17-59-08-059.png|width=741,height=85!
 
{*}FSCK Info{*}:
!image-2024-08-03-18-00-01-950.png|width=738,height=120!
 
{*}Root Cause{*}:
Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node is 
processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state in the 
locations is filtered, but the indices are not adapted, resulting in a mismatch 
between the location and indices lengths. Finally, ec block calculates the 
wrong blockid when getting internal block (see 
`DBlockStriped#getInternalBlock`).
 
We added debug logs, and a few key messages are shown below. {color:#FF0000}The 
result is an incorrect correspondence: xx.xx.7.31 -> 
-9223372033441574270{color}.
{code:java}
DBlock getInternalBlock(StorageGroup storage) {
  // storage == xx.xx.7.31
  // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, 
xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, 
xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is 
filtered)
  int idxInLocs = locations.indexOf(storage);
  if (idxInLocs == -1) {
    return null;
  }
  // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])   
  byte idxInGroup = indices[idxInLocs];
  // blkId: -9223372033441574272 + 2 = -9223372033441574270
  long blkId = getBlock().getBlockId() + idxInGroup;
  long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
      dataBlockNum, idxInGroup);
  Block blk = new Block(getBlock());
  blk.setBlockId(blkId);
  blk.setNumBytes(numBytes);
  DBlock dblk = new DBlock(blk);
  dblk.addLocation(storage);
  return dblk;
} {code}
{*}Solution{*}:
When initializing DBlockStriped, if any location is filtered out, we need to 
remove the corresponding element in the indices to do the adaptation.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDFS-17599) Fix the mismatch between locations and indices for mover

Reply via email to