Zinan Zhuang created HDFS-17553: ----------------------------------- Summary: DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures Key: HDFS-17553 URL: https://issues.apache.org/jira/browse/HDFS-17553 Project: Hadoop HDFS Issue Type: Improvement Components: dfsclient Affects Versions: 3.4.0, 3.3.1 Reporter: Zinan Zhuang
[HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting more interrupts during the flushInternal call when we are closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file failed to close on HDFS side, the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in each regionserver, which means these files remain undead until the regionserver gets restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org