Jack Yang created HDFS-17608: -------------------------------- Summary: Datanodes Decommissioning hang forever if the node under decommissioning has disk media error Key: HDFS-17608 URL: https://issues.apache.org/jira/browse/HDFS-17608 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.3.6 Environment: Redhat 8.7, Hadoop 3.3.6 Reporter: Jack Yang Attachments: image-2024-08-26-10-37-27-359.png
The blocks on the decommissioning datanode are all EC striped block. The decommissioning progress hangs forever and keeping output these logs: 2024-08-26 10:31:14,748 WARN datanode.DataNode (DataNode.java:run(2927)) - DatanodeRegistration(10.18.130.251:1019, datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe curePort=0, ipcPort=8010, storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed to transfer BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to x.x.x.x:1019 got java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215) at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2024-08-26 10:31:14,758 WARN datanode.DataNode (BlockSender.java:readChecksum(693)) - Could not read or failed to verify checksum for data at offset 10878976 for block BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731 java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215) at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) The namenode outputs: 2024-08-26 10:39:13,404 INFO BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: 10.18.130.251:1019, Is current datanode decommissioning: true, Is current datanode entering maintenance: false 2024-08-26 10:39:13,404 INFO blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still has 3 blocks to replicate before it is a candidate to finish Decommission In Progress. 2024-08-26 10:39:13,404 INFO blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes this tick. 1 nodes are now in maintenance or transitioning state. 0 nodes pending. The block (blk_-9223372036501683307_35436379) that datanode is trying to access is in the disk which has media error. The dmesg keeps saying: [Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector 12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium Error [current] [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No additional sense information [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 00 00 03 06 09 e3 b0 00 00 00 08 00 00 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org