Re: meaning of LEASE_RECOVER_PERIOD

2010-07-26 Thread Todd Lipcon
Hi Thanh,

It's a little bit of a hack - basically, this is a time that determines how
often lease recovery can be re-initiated in the NN. The assumption is that
if a recovery attempt has not completed within LEASE_RECOVERY_PERIOD, it can
be retried using a different DN as the "primary" (ie recovery coordinator)

-Todd

On Sat, Jul 24, 2010 at 7:00 PM, Thanh Do  wrote:

> Hi,
>
> I look at FSConstants and see this
>
> LEASE_RECOVER_PERIOD = 10 * 1000; // i.e 10 seconds
>
> and the only place this is used is in:
>
> INodeFileUnderConstruction.setLastRecoveryTime()
>
> Can any one explain to me the intuition behind this?
> Why this value is fixed to be 10 seconds?
>
> Thanks
> --
> thanh
>



-- 
Todd Lipcon
Software Engineer, Cloudera


[jira] Created: (HDFS-1319) Fix location of re-login for secondary namenode from HDFS-999

2010-07-26 Thread Jakob Homan (JIRA)
Fix location of re-login for secondary namenode from HDFS-999
-

 Key: HDFS-1319
 URL: https://issues.apache.org/jira/browse/HDFS-1319
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: 0.22.0
Reporter: Jakob Homan
Assignee: Jakob Homan
 Fix For: 0.22.0


A bugfix to the original patch (HDFS-999) hasn't been applied to trunk yet, 
causing the 2nn to not be logged in at the correct time when it makes an RPC 
call.  This JIRA is to forward port the bugfix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1320) Add LOG.isDebugEnabled() guard for each LOG.debug("...")

2010-07-26 Thread Erik Steffl (JIRA)
Add LOG.isDebugEnabled() guard for each LOG.debug("...")


 Key: HDFS-1320
 URL: https://issues.apache.org/jira/browse/HDFS-1320
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 0.22.0
Reporter: Erik Steffl
 Fix For: 0.22.0


Each LOG.debug("...") should be executed only if LOG.isDebugEnabled() is true, 
in some cases it's expensive to construct the string that is being printed to 
log. It's much easier to always use LOG.isDebugEnabled() because it's easier to 
check (rather than in each case reason wheather it's neccessary or not).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock

2010-07-26 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1229.
---

Resolution: Cannot Reproduce

Thanks for checking against the append branch, marking as resolved since the 
bug is fixed.

> DFSClient incorrectly asks for new block if primary crashes during first 
> recoverBlock
> -
>
> Key: HDFS-1229
> URL: https://issues.apache.org/jira/browse/HDFS-1229
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client
>Affects Versions: 0.20-append
>Reporter: Thanh Do
>
> Setup:
> 
> + # available datanodes = 2
> + # disks / datanode = 1
> + # failures = 1
> + failure type = crash
> + When/where failure happens = during primary's recoverBlock
>  
> Details:
> --
> Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
> First it needs to make sure both dn1 and dn2  agree on the new GS of the 
> block.
> 1) Client first creates DFSOutputStream by calling
>  
> >OutputStream result = new DFSOutputStream(src, buffersize, progress,
> >lastBlock, stat, 
> > conf.getInt("io.bytes.per.checksum", 512));
>  
> in DFSClient.append()
>  
> 2) The above DFSOutputStream constructor in turn calls 
> processDataNodeError(true, true)
> (i.e, hasError = true, isAppend = true), and starts the DataStreammer
>  
> > processDatanodeError(true, true);  /* let's call this PDNE 1 */
> > streamer.start();
>  
> Note that DataStreammer.run() also calls processDatanodeError()
> > while (!closed && clientRunning) {
> >  ...
> >  boolean doSleep = processDatanodeError(hasError, false); /let's call 
> > this PDNE 2*/
>  
> 3) Now in the PDNE 1, we have following code:
>  
> > blockStream = null;
> > blockReplyStream = null;
> > ...
> > while (!success && clientRunning) {
> > ...
> >try {
> > primary = createClientDatanodeProtocolProxy(primaryNode, conf);
> > newBlock = primary.recoverBlock(block, isAppend, newnodes); 
> > /*exception here*/
> > ...
> >catch (IOException e) {
> > ...
> > if (recoveryErrorCount > maxRecoveryErrorCount) { 
> > // this condition is false
> > }
> > ...
> > return true;
> >} // end catch
> >finally {...}
> >
> >this.hasError = false;
> >lastException = null;
> >errorIndex = 0;
> >success = createBlockOutputStream(nodes, clientName, true);
> >}
> >...
>  
> Because dn1 crashes during client call to recoverBlock, we have an exception.
> Hence, go to the catch block, in which processDatanodeError returns true
> before setting hasError to false. Also, because createBlockOutputStream() is 
> not called
> (due to an early return), blockStream is still null.
>  
> 4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2.
> Because hasError = false, PDNE 2 returns false immediately without doing 
> anything
> > if (!hasError) { return false; }
>  
> 5) still in the DataStreamer.run(), after returning false from PDNE 2, we 
> still have
> blockStream = null, hence the following code is executed:
> if (blockStream == null) {
>nodes = nextBlockOutputStream(src);
>this.setName("DataStreamer for file " + src + " block " + block);
>response = new ResponseProcessor(nodes);
>response.start();
> }
>  
> nextBlockOutputStream which asks namenode to allocate new Block is called.
> (This is not good, because we are appending, not writing).
> Namenode gives it new Block ID and a set of datanodes, including crashed dn1.
> this leads to createOutputStream() fails because it tries to contact the dn1 
> first.
> (which has crashed). The client retries 5 times without any success,
> because every time, it asks namenode for new block! Again we see
> that the retry logic at client is weird!
> *This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
> Haryadi Gunawi (hary...@eecs.berkeley.edu)*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.