Harsh J created HDFS-4246:
-----------------------------

             Summary: The exclude node list should be more forgiving, for each 
output stream
                 Key: HDFS-4246
                 URL: https://issues.apache.org/jira/browse/HDFS-4246
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: hdfs-client
            Reporter: Harsh J
            Priority: Minor


Originally observed by Inder on the mailing lists:

{quote}
Folks,

i was wondering if there is any mechanism/logic to move a node back from the 
excludedNodeList to live nodes to be tried for new block creation.

In the current DFSOutputStream code i do not see this. The use-case is if the 
write timeout is being reduced and certain nodes get aggressively added to the 
excludedNodeList and the client caches DFSOutputStream then the excludedNodes 
never get tried again in the lifetime of the application caching DFSOutputStream
{quote}

What this leads to, is a special scenario, that may impact smaller clusters 
more than larger ones:

1. File is opened for continuos hflush/sync-based writes, such as a HBase WAL 
for example. This file is gonna be kept open for a very very long time, by 
design.
2. Over time, nodes are excluded for various errors, such as DN crashes, 
network failures, etc.
3. Eventually, exclude list == live nodes list or close, and the write suffers. 
At time of equality, the write also fails with an error of not being able to 
get a block allocation.

We should perhaps make the excludeNodes list a timed-cache collection, so that 
even if it begins filling up, the older excludes are pruned away, giving those 
nodes a try again for later.

One place we have to be careful about, though, is rack-failures. Those 
sometimes never come back fast enough, and can be problematic to retry code 
with such an eventually-forgiving list. Perhaps we can retain forgiven nodes 
and if they are entered again, we may double or triple the forgiveness value 
(in time units), to counter this? Its just one idea.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to