Tao Li created HDFS-17599: ----------------------------- Summary: Fix the mismatch between locations and indices for mover Key: HDFS-17599 URL: https://issues.apache.org/jira/browse/HDFS-17599 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.4.0, 3.3.0 Reporter: Tao Li Assignee: Tao Li Attachments: image-2024-08-03-17-59-08-059.png, image-2024-08-03-18-00-01-950.png
We set the EC policy to (6+3) and also have nodes that were in state ENTERING_MAINTENANCE. When we move the data of some directories from SSD to HDD, some blocks move fail due to disk full, as shown in the figure below (blk_-9223372033441574269). We tried to move again and found the following error "{color:#FF0000}Replica does not exist{color}". Observing the information of fsck, it can be found that the wrong blockid(blk_-9223372033441574270) was found when moving block. {*}Mover Logs{*}: !image-2024-08-03-17-59-08-059.png|width=741,height=85! {*}FSCK Info{*}: !image-2024-08-03-18-00-01-950.png|width=738,height=120! {*}Root Cause{*}: Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state in the locations is filtered, but the indices are not adapted, resulting in a mismatch between the location and indices lengths. Finally, ec block calculates the wrong blockid when getting internal block (see `DBlockStriped#getInternalBlock`). We added debug logs, and a few key messages are shown below. {color:#FF0000}The result is an incorrect correspondence: xx.xx.7.31 -> -9223372033441574270{color}. {code:java} DBlock getInternalBlock(StorageGroup storage) { // storage == xx.xx.7.31 // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is filtered) int idxInLocs = locations.indexOf(storage); if (idxInLocs == -1) { return null; } // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8]) byte idxInGroup = indices[idxInLocs]; // blkId: -9223372033441574272 + 2 = -9223372033441574270 long blkId = getBlock().getBlockId() + idxInGroup; long numBytes = getInternalBlockLength(getNumBytes(), cellSize, dataBlockNum, idxInGroup); Block blk = new Block(getBlock()); blk.setBlockId(blkId); blk.setNumBytes(numBytes); DBlock dblk = new DBlock(blk); dblk.addLocation(storage); return dblk; } {code} {*}Solution{*}: When initializing DBlockStriped, if any location is filtered out, we need to remove the corresponding element in the indices to do the adaptation. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org