[ 
https://issues.apache.org/jira/browse/HDFS-17808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016704#comment-18016704
 ] 

ASF GitHub Bot commented on HDFS-17808:
---------------------------------------

hfutatzhanghb commented on PR #7810:
URL: https://github.com/apache/hadoop/pull/7810#issuecomment-3232146220

   @Hexiaoqiao Thanks very much for reviewing. Please allow me to define issue 
clearly here.
   
   Recently, while exploring the use of HDFS Erasure Coding (EC) for hot-data 
storage, we encountered some problems and the current issue is one of them.
   
   **Problem description (pseudo-code):**
   
   ```java
   DFSStripedOutputStream os = dfs.create(path);
   // The task may run for several hours, so the output-stream object is also 
held open for hours.
   while (task is not finished) {
       data = doSomeComputeLogicAndGetData();
       os.write(data);
   }
   os.close();
   ```
   
   When we perform a rolling restart of DataNodes, the above task fails.
   The root cause is that, during writing, an EC output stream will exclude any 
bad DataNode from the pipeline, but there is no mechanism to add new DataNodes 
to replace the excluded ones. Once more than three DataNodes have been 
excluded, the output stream no longer has enough DataStreamers to continue 
writing and therefore aborts.
   
   So, this pr try to resolve such problem by ending block group in advance 
when meet failed streamers(count of streamer <= 3)to allocate new block, after 
allocating new block, we wil have sufficient data streamer.




> EC: End block group in advance to prevent write failure for long-time running 
> OutputStream
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17808
>                 URL: https://issues.apache.org/jira/browse/HDFS-17808
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ec, erasure-coding
>            Reporter: farmmamba
>            Assignee: farmmamba
>            Priority: Major
>              Labels: pull-request-available
>
> Recently, we met an EC problem in our production.
> User creates an output stream to write ec files.  That output stream writes 
> some bytes and will be idle for a long time until data is ready.  If we 
> restart our cluster's datanodes to version up,  those applications will  
> finally fail due to not have enough healthy streamers.
>  
> This Jira try to solve above problem by end block group in advance when we 
> already have failed streamers but less than parity number.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to