[ https://issues.apache.org/jira/browse/HDFS-17808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016704#comment-18016704 ]
ASF GitHub Bot commented on HDFS-17808: --------------------------------------- hfutatzhanghb commented on PR #7810: URL: https://github.com/apache/hadoop/pull/7810#issuecomment-3232146220 @Hexiaoqiao Thanks very much for reviewing. Please allow me to define issue clearly here. Recently, while exploring the use of HDFS Erasure Coding (EC) for hot-data storage, we encountered some problems and the current issue is one of them. **Problem description (pseudo-code):** ```java DFSStripedOutputStream os = dfs.create(path); // The task may run for several hours, so the output-stream object is also held open for hours. while (task is not finished) { data = doSomeComputeLogicAndGetData(); os.write(data); } os.close(); ``` When we perform a rolling restart of DataNodes, the above task fails. The root cause is that, during writing, an EC output stream will exclude any bad DataNode from the pipeline, but there is no mechanism to add new DataNodes to replace the excluded ones. Once more than three DataNodes have been excluded, the output stream no longer has enough DataStreamers to continue writing and therefore aborts. So, this pr try to resolve such problem by ending block group in advance when meet failed streamers(count of streamer <= 3)to allocate new block, after allocating new block, we wil have sufficient data streamer. > EC: End block group in advance to prevent write failure for long-time running > OutputStream > ------------------------------------------------------------------------------------------ > > Key: HDFS-17808 > URL: https://issues.apache.org/jira/browse/HDFS-17808 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding > Reporter: farmmamba > Assignee: farmmamba > Priority: Major > Labels: pull-request-available > > Recently, we met an EC problem in our production. > User creates an output stream to write ec files. That output stream writes > some bytes and will be idle for a long time until data is ready. If we > restart our cluster's datanodes to version up, those applications will > finally fail due to not have enough healthy streamers. > > This Jira try to solve above problem by end block group in advance when we > already have failed streamers but less than parity number. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org