Jack Yang created HDFS-17608:
--------------------------------

             Summary: Datanodes Decommissioning hang forever if the node under 
decommissioning has disk media error
                 Key: HDFS-17608
                 URL: https://issues.apache.org/jira/browse/HDFS-17608
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 3.3.6
         Environment: Redhat 8.7, Hadoop 3.3.6
            Reporter: Jack Yang
         Attachments: image-2024-08-26-10-37-27-359.png

The blocks on the decommissioning datanode are all EC striped block. The 
decommissioning progress hangs forever and keeping output these logs:

 

2024-08-26 10:31:14,748 WARN  datanode.DataNode (DataNode.java:run(2927)) - 
DatanodeRegistration(10.18.130.251:1019, 
datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe
curePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed
 to transfer 
BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to 
x.x.x.x:1019 got
java.io.IOException: Input/output error
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
2024-08-26 10:31:14,758 WARN  datanode.DataNode 
(BlockSender.java:readChecksum(693)) -  Could not read or failed to verify 
checksum for data at offset 10878976 for block 
BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731
java.io.IOException: Input/output error
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

The namenode outputs:

2024-08-26 10:39:13,404 INFO  BlockStateChange 
(DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: 
blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, 
corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, 
Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 
10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 
10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: 
10.18.130.251:1019, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor 
(DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still 
has 3 blocks to replicate before it is a candidate to finish Decommission In 
Progress.
2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor 
(DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes this 
tick. 1 nodes are now in maintenance or transitioning state. 0 nodes pending.

The block (blk_-9223372036501683307_35436379) that datanode is trying to access 
is in the disk which has media error.  The dmesg keeps saying:

[Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector 
12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium Error 
[current] 
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No additional 
sense information
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 00 
00 03 06 09 e3 b0 00 00 00 08 00 00



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to