Re: profiling hdfs write path

2012-12-05 Thread Steve Loughran
On 5 December 2012 02:00, Radim Kolar  wrote:

>
>  Agree.  Want to write some?
>>
> Its not about writing patches, its about to get them committed. I have
> experience that getting something committed takes months even on simple
> patch. I have about 10 patches floating around none of them was committed
> in last 4 weeks. They are really simple stuff. I haven't tried to go with
> some more elaborated patch because Bible says: if you fail easy thing, you
> will fail hard thing too.
>
>
There is inertia; nobody is happy with it -but that's the price of having
something that's designed to keep PB of data safe.



> I am thinking day by day that i really need to fork hadoop otherwise there
> is no way to move it forward where i need it to be.
>

A lot of the early hadoop projects chose this path. Once you get out of
sync with the apache code you have two problems
 -keeping your branch up to date with all fixes and features you want.
 -testing


[jira] [Created] (HDFS-4270) Replications of the highest priority should be allowed to choose a source datanode that has reached its max replication limit

2012-12-05 Thread Derek Dagit (JIRA)
Derek Dagit created HDFS-4270:
-

 Summary: Replications of the highest priority should be allowed to 
choose a source datanode that has reached its max replication limit
 Key: HDFS-4270
 URL: https://issues.apache.org/jira/browse/HDFS-4270
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.5, 3.0.0
Reporter: Derek Dagit
Assignee: Derek Dagit
Priority: Minor
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.6


Blocks that have been identified as under-replicated are placed on one of 
several priority queues.  The highest priority queue is essentially reserved 
for situations in which only one replica of the block exists, meaning it should 
be replicated ASAP.

The ReplicationMonitor periodically computes replication work, and a call to 
BlockManager#chooseUnderReplicatedBlocks selects a given number of 
under-replicated blocks, choosing blocks from the highest-priority queue first 
and working down to the lowest priority queue.

In the subsequent call to BlockManager#computeReplicationWorkForBlocks, a 
source for the replication is chosen from among datanodes that have an 
available copy of the block needed.  This is done in 
BlockManager#chooseSourceDatanode.


chooseSourceDatanode's job is to choose the datanode for replication.  It 
chooses a random datanode from the available datanodes that has not reached its 
replication limit (preferring datanodes that are currently decommissioning).

However, the priority queue of the block does not inform the logic.  If a 
datanode holds the last remaining replica of a block and has already reached 
its replication limit, the node is dismissed outright and the replication is 
not scheduled.

In some situations, this could lead to data loss, as the last remaining replica 
could disappear if an opportunity is not taken to schedule a replication.  It 
would be better to waive the max replication limit in cases of highest-priority 
block replication.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4271) Problem in DFSInputStream read retry logic may cause early failure

2012-12-05 Thread Binglin Chang (JIRA)
Binglin Chang created HDFS-4271:
---

 Summary: Problem in DFSInputStream read retry logic may cause 
early failure
 Key: HDFS-4271
 URL: https://issues.apache.org/jira/browse/HDFS-4271
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Binglin Chang
Assignee: Binglin Chang
Priority: Minor


Assume the following call logic
{noformat} 
readWithStrategy()
  -> blockSeekTo()
  -> readBuffer()
 -> reader.doRead()
 -> seekToNewSource() add currentNode to deadnode, wish to get a different 
datanode
-> blockSeekTo()
   -> chooseDataNode()
  -> block missing, clear deadNodes and pick the currentNode again
seekToNewSource() return false
 readBuffer() re-throw the exception quit loop
readWithStrategy() got the exception,  and may fail the read call before tried 
MaxBlockAcquireFailures.
{noformat} 
some issues of the logic:
1. seekToNewSource() logic is broken because it may clear deadNodes in the 
middle.
2. the variable "int retries=2" in readWithStrategy seems have conflict with 
MaxBlockAcquireFailures, should it be removed?

I write a test to produce the scenario, and here is part of the log:

{noformat} 
2012-12-05 22:55:15,135 WARN  hdfs.DFSClient 
(DFSInputStream.java:readBuffer(596)) - Found Checksum error for 
BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from 
127.0.0.1:50099 at 0
2012-12-05 22:55:15,136 INFO  DataNode.clienttrace 
(BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: 
/127.0.0.1:50105, bytes: 4128, op: HDFS_READ, cliID: 
DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: 
DS-91625336-192.168.0.101-50099-1354719314603, blockid: 
BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, duration: 
2925000
2012-12-05 22:55:15,136 INFO  hdfs.DFSClient 
(DFSInputStream.java:chooseDataNode(741)) - Could not obtain 
BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from any 
node: java.io.IOException: No live nodes contain current block. Will get new 
block locations from namenode and retry...
2012-12-05 22:55:15,136 WARN  hdfs.DFSClient 
(DFSInputStream.java:chooseDataNode(756)) - DFS chooseDataNode: got # 1 
IOException, will wait for 274.34891931868265 msec.
2012-12-05 22:55:15,413 INFO  DataNode.clienttrace 
(BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: 
/127.0.0.1:50106, bytes: 4128, op: HDFS_READ, cliID: 
DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: 
DS-91625336-192.168.0.101-50099-1354719314603, blockid: 
BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, duration: 
283000
2012-12-05 22:55:15,414 INFO  hdfs.StateChange 
(FSNamesystem.java:reportBadBlocks(4761)) - *DIR* reportBadBlocks
2012-12-05 22:55:15,415 INFO  BlockStateChange 
(CorruptReplicasMap.java:addToCorruptReplicasMap(66)) - BLOCK 
NameSystem.addToCorruptReplicasMap: blk_-705068286766485620 added as corrupt on 
127.0.0.1:50099 by null because client machine reported it
2012-12-05 22:55:15,416 INFO  hdfs.TestClientReportBadBlock 
(TestDFSInputStream.java:testDFSInputStreamReadRetryTime(94)) - catch 
IOExceptionorg.apache.hadoop.fs.ChecksumException: Checksum error: /testFile at 
0 exp: 809972010 got: -1374622118
2012-12-05 22:55:15,431 INFO  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:shutdown(1411)) - Shutting down the Mini HDFS Cluster
{noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4272) Problem in DFSInputStream read retry logic may cause early failure

2012-12-05 Thread Binglin Chang (JIRA)
Binglin Chang created HDFS-4272:
---

 Summary: Problem in DFSInputStream read retry logic may cause 
early failure
 Key: HDFS-4272
 URL: https://issues.apache.org/jira/browse/HDFS-4272
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Binglin Chang
Assignee: Binglin Chang
Priority: Minor


Assume the following call logic
{noformat} 
readWithStrategy()
  -> blockSeekTo()
  -> readBuffer()
 -> reader.doRead()
 -> seekToNewSource() add currentNode to deadnode, wish to get a different 
datanode
-> blockSeekTo()
   -> chooseDataNode()
  -> block missing, clear deadNodes and pick the currentNode again
seekToNewSource() return false
 readBuffer() re-throw the exception quit loop
readWithStrategy() got the exception,  and may fail the read call before tried 
MaxBlockAcquireFailures.
{noformat} 
some issues of the logic:
1. seekToNewSource() logic is broken because it may clear deadNodes in the 
middle.
2. the variable "int retries=2" in readWithStrategy seems have conflict with 
MaxBlockAcquireFailures, should it be removed?

I write a test to produce the scenario, and here is part of the log:

{noformat} 
2012-12-05 22:55:15,135 WARN  hdfs.DFSClient 
(DFSInputStream.java:readBuffer(596)) - Found Checksum error for 
BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from 
127.0.0.1:50099 at 0
2012-12-05 22:55:15,136 INFO  DataNode.clienttrace 
(BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: 
/127.0.0.1:50105, bytes: 4128, op: HDFS_READ, cliID: 
DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: 
DS-91625336-192.168.0.101-50099-1354719314603, blockid: 
BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, duration: 
2925000
2012-12-05 22:55:15,136 INFO  hdfs.DFSClient 
(DFSInputStream.java:chooseDataNode(741)) - Could not obtain 
BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from any 
node: java.io.IOException: No live nodes contain current block. Will get new 
block locations from namenode and retry...
2012-12-05 22:55:15,136 WARN  hdfs.DFSClient 
(DFSInputStream.java:chooseDataNode(756)) - DFS chooseDataNode: got # 1 
IOException, will wait for 274.34891931868265 msec.
2012-12-05 22:55:15,413 INFO  DataNode.clienttrace 
(BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: 
/127.0.0.1:50106, bytes: 4128, op: HDFS_READ, cliID: 
DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: 
DS-91625336-192.168.0.101-50099-1354719314603, blockid: 
BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, duration: 
283000
2012-12-05 22:55:15,414 INFO  hdfs.StateChange 
(FSNamesystem.java:reportBadBlocks(4761)) - *DIR* reportBadBlocks
2012-12-05 22:55:15,415 INFO  BlockStateChange 
(CorruptReplicasMap.java:addToCorruptReplicasMap(66)) - BLOCK 
NameSystem.addToCorruptReplicasMap: blk_-705068286766485620 added as corrupt on 
127.0.0.1:50099 by null because client machine reported it
2012-12-05 22:55:15,416 INFO  hdfs.TestClientReportBadBlock 
(TestDFSInputStream.java:testDFSInputStreamReadRetryTime(94)) - catch 
IOExceptionorg.apache.hadoop.fs.ChecksumException: Checksum error: /testFile at 
0 exp: 809972010 got: -1374622118
2012-12-05 22:55:15,431 INFO  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:shutdown(1411)) - Shutting down the Mini HDFS Cluster
{noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4273) Problem in DFSInputStream read retry logic may cause early failure

2012-12-05 Thread Binglin Chang (JIRA)
Binglin Chang created HDFS-4273:
---

 Summary: Problem in DFSInputStream read retry logic may cause 
early failure
 Key: HDFS-4273
 URL: https://issues.apache.org/jira/browse/HDFS-4273
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Binglin Chang
Assignee: Binglin Chang
Priority: Minor


Assume the following call logic
{noformat} 
readWithStrategy()
  -> blockSeekTo()
  -> readBuffer()
 -> reader.doRead()
 -> seekToNewSource() add currentNode to deadnode, wish to get a different 
datanode
-> blockSeekTo()
   -> chooseDataNode()
  -> block missing, clear deadNodes and pick the currentNode again
seekToNewSource() return false
 readBuffer() re-throw the exception quit loop
readWithStrategy() got the exception,  and may fail the read call before tried 
MaxBlockAcquireFailures.
{noformat} 
some issues of the logic:
1. seekToNewSource() logic is broken because it may clear deadNodes in the 
middle.
2. the variable "int retries=2" in readWithStrategy seems have conflict with 
MaxBlockAcquireFailures, should it be removed?


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-4272) Problem in DFSInputStream read retry logic may cause early failure

2012-12-05 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas resolved HDFS-4272.
---

Resolution: Duplicate

Seems like duplicate of HDFS-4271.

> Problem in DFSInputStream read retry logic may cause early failure
> --
>
> Key: HDFS-4272
> URL: https://issues.apache.org/jira/browse/HDFS-4272
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Minor
>
> Assume the following call logic
> {noformat} 
> readWithStrategy()
>   -> blockSeekTo()
>   -> readBuffer()
>  -> reader.doRead()
>  -> seekToNewSource() add currentNode to deadnode, wish to get a 
> different datanode
> -> blockSeekTo()
>-> chooseDataNode()
>   -> block missing, clear deadNodes and pick the currentNode again
> seekToNewSource() return false
>  readBuffer() re-throw the exception quit loop
> readWithStrategy() got the exception,  and may fail the read call before 
> tried MaxBlockAcquireFailures.
> {noformat} 
> some issues of the logic:
> 1. seekToNewSource() logic is broken because it may clear deadNodes in the 
> middle.
> 2. the variable "int retries=2" in readWithStrategy seems have conflict with 
> MaxBlockAcquireFailures, should it be removed?
> I write a test to produce the scenario, and here is part of the log:
> {noformat} 
> 2012-12-05 22:55:15,135 WARN  hdfs.DFSClient 
> (DFSInputStream.java:readBuffer(596)) - Found Checksum error for 
> BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from 
> 127.0.0.1:50099 at 0
> 2012-12-05 22:55:15,136 INFO  DataNode.clienttrace 
> (BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: 
> /127.0.0.1:50105, bytes: 4128, op: HDFS_READ, cliID: 
> DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: 
> DS-91625336-192.168.0.101-50099-1354719314603, blockid: 
> BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, 
> duration: 2925000
> 2012-12-05 22:55:15,136 INFO  hdfs.DFSClient 
> (DFSInputStream.java:chooseDataNode(741)) - Could not obtain 
> BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from any 
> node: java.io.IOException: No live nodes contain current block. Will get new 
> block locations from namenode and retry...
> 2012-12-05 22:55:15,136 WARN  hdfs.DFSClient 
> (DFSInputStream.java:chooseDataNode(756)) - DFS chooseDataNode: got # 1 
> IOException, will wait for 274.34891931868265 msec.
> 2012-12-05 22:55:15,413 INFO  DataNode.clienttrace 
> (BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: 
> /127.0.0.1:50106, bytes: 4128, op: HDFS_READ, cliID: 
> DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: 
> DS-91625336-192.168.0.101-50099-1354719314603, blockid: 
> BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, 
> duration: 283000
> 2012-12-05 22:55:15,414 INFO  hdfs.StateChange 
> (FSNamesystem.java:reportBadBlocks(4761)) - *DIR* reportBadBlocks
> 2012-12-05 22:55:15,415 INFO  BlockStateChange 
> (CorruptReplicasMap.java:addToCorruptReplicasMap(66)) - BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_-705068286766485620 added as corrupt 
> on 127.0.0.1:50099 by null because client machine reported it
> 2012-12-05 22:55:15,416 INFO  hdfs.TestClientReportBadBlock 
> (TestDFSInputStream.java:testDFSInputStreamReadRetryTime(94)) - catch 
> IOExceptionorg.apache.hadoop.fs.ChecksumException: Checksum error: /testFile 
> at 0 exp: 809972010 got: -1374622118
> 2012-12-05 22:55:15,431 INFO  hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:shutdown(1411)) - Shutting down the Mini HDFS Cluster
> {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4274) BlockPoolSliceScanner does not close verification log during shutdown

2012-12-05 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-4274:
---

 Summary: BlockPoolSliceScanner does not close verification log 
during shutdown
 Key: HDFS-4274
 URL: https://issues.apache.org/jira/browse/HDFS-4274
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: trunk-win
Reporter: Chris Nauroth
Assignee: Chris Nauroth


{{BlockPoolSliceScanner}} holds open a handle to a verification log.  This file 
is not getting closed during process shutdown.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4275) MiniDFSCluster-based tests fail on Windows due to failure to delete test name node directory

2012-12-05 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-4275:
---

 Summary: MiniDFSCluster-based tests fail on Windows due to failure 
to delete test name node directory
 Key: HDFS-4275
 URL: https://issues.apache.org/jira/browse/HDFS-4275
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: trunk-win
Reporter: Chris Nauroth
Assignee: Chris Nauroth


Multiple HDFS test suites fail on Windows during initialization of 
{{MiniDFSCluster}} due to "Could not fully delete" the name testing data 
directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4276) HDFS tests have multiple failures on Windows due to file locking conflict

2012-12-05 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-4276:
---

 Summary: HDFS tests have multiple failures on Windows due to file 
locking conflict
 Key: HDFS-4276
 URL: https://issues.apache.org/jira/browse/HDFS-4276
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: trunk-win
Reporter: Chris Nauroth
Assignee: Chris Nauroth


Multiple HDFS tests fail on Windows due to "The process cannot access the file 
because another process has locked a portion of the file".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-12-05 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3049.
---

   Resolution: Fixed
Fix Version/s: 2.0.3-alpha

Fixed the extra imports and committed to branch-2, thanks for the reviews.

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: namenode
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch, HDFS-3049.021.patch, HDFS-3049.023.patch, 
> HDFS-3049.025.patch, HDFS-3049.026.patch, HDFS-3049.027.patch, 
> HDFS-3049.028.patch, HDFS-3049.028.patch, HDFS-3049.028.patch, 
> hdfs-3049-branch-2.txt
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-3571) Allow EditLogFileInputStream to read from a remote URL

2012-12-05 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3571.
---

   Resolution: Fixed
Fix Version/s: 2.0.3-alpha

Committed backport to branch-2. Thanks for reviewing.

> Allow EditLogFileInputStream to read from a remote URL
> --
>
> Key: HDFS-3571
> URL: https://issues.apache.org/jira/browse/HDFS-3571
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Affects Versions: 3.0.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: hdfs-3571-branch-2.txt, hdfs-3571.txt, hdfs-3571.txt
>
>
> In order to start up from remote edits storage (like the JournalNodes of 
> HDFS-3077), the NN needs to be able to load edits from a URL, instead of just 
> local disk. This JIRA extends EditLogFileInputStream to be able to use a URL 
> reference in addition to the current File reference.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

2012-12-05 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3077.
---

   Resolution: Fixed
Fix Version/s: 2.0.3-alpha

Committed backport to branch-2. Thanks for looking at the backport patch, 
Andrew and Aaron.

> Quorum-based protocol for reading and writing edit logs
> ---
>
> Key: HDFS-3077
> URL: https://issues.apache.org/jira/browse/HDFS-3077
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: ha, namenode
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 3.0.0, QuorumJournalManager (HDFS-3077), 2.0.3-alpha
>
> Attachments: hdfs-3077-branch-2.txt, hdfs-3077-partial.txt, 
> hdfs-3077-test-merge.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.tex, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (HDFS-4110) Refine JNStorage log

2012-12-05 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reopened HDFS-4110:
---


Reopening to backport to branch-2

> Refine JNStorage log
> 
>
> Key: HDFS-4110
> URL: https://issues.apache.org/jira/browse/HDFS-4110
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: liang xie
>Assignee: liang xie
>Priority: Trivial
>  Labels: newbie
> Fix For: 3.0.0
>
> Attachments: HDFS-4110.txt
>
>
> Abstract class Storage has a toString method: 
> {quote}
> return "Storage Directory " + this.root;
> {quote}
> and in the subclass JNStorage we could see:
> {quote}
> LOG.info("Formatting journal storage directory " + 
> sd + " with nsid: " + getNamespaceID());
> {quote}
> that'll print sth like "Formatting journal storage directory Storage 
> Directory x"
> Just one line change to:
> {quota}
> LOG.info("Formatting journal " + sd + " with nsid: " + getNamespaceID());
> {quota}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-4110) Refine JNStorage log

2012-12-05 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-4110.
---

   Resolution: Fixed
Fix Version/s: 2.0.3-alpha

Committed backport to branch-2 (same patch applied)

> Refine JNStorage log
> 
>
> Key: HDFS-4110
> URL: https://issues.apache.org/jira/browse/HDFS-4110
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: liang xie
>Assignee: liang xie
>Priority: Trivial
>  Labels: newbie
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: HDFS-4110.txt
>
>
> Abstract class Storage has a toString method: 
> {quote}
> return "Storage Directory " + this.root;
> {quote}
> and in the subclass JNStorage we could see:
> {quote}
> LOG.info("Formatting journal storage directory " + 
> sd + " with nsid: " + getNamespaceID());
> {quote}
> that'll print sth like "Formatting journal storage directory Storage 
> Directory x"
> Just one line change to:
> {quota}
> LOG.info("Formatting journal " + sd + " with nsid: " + getNamespaceID());
> {quota}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4277) SocketTimeoutExceptions over the DataXciever service of a DN should print the DFSClient ID

2012-12-05 Thread Harsh J (JIRA)
Harsh J created HDFS-4277:
-

 Summary: SocketTimeoutExceptions over the DataXciever service of a 
DN should print the DFSClient ID
 Key: HDFS-4277
 URL: https://issues.apache.org/jira/browse/HDFS-4277
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.0.0-alpha
Reporter: Harsh J
Priority: Minor


Currently, when one faces a SocketTimeoutExceptions (or any exception rather, 
in a DN log, for a client <-> DN interaction), we fail to print the DFSClient 
ID. This makes it untraceable (like is it a speculative MR caused timeout, or a 
RS crash, etc.).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4278) The DFS_BLOCK_ACCESS_TOKEN_ENABLE config should be automatically turned on when security is enabled.

2012-12-05 Thread Harsh J (JIRA)
Harsh J created HDFS-4278:
-

 Summary: The DFS_BLOCK_ACCESS_TOKEN_ENABLE config should be 
automatically turned on when security is enabled.
 Key: HDFS-4278
 URL: https://issues.apache.org/jira/browse/HDFS-4278
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, namenode
Affects Versions: 2.0.0-alpha
Reporter: Harsh J


When enabling security, one has to manually enable the config 
DFS_BLOCK_ACCESS_TOKEN_ENABLE (dfs.block.access.token.enable). Since these two 
are coupled, we could make it turn itself on automatically if we find security 
to be enabled.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: profiling hdfs write path

2012-12-05 Thread Andy Isaacson
On Tue, Dec 4, 2012 at 6:00 PM, Radim Kolar  wrote:
> Its not about writing patches, its about to get them committed. I have
> experience that getting something committed takes months even on simple
> patch. I have about 10 patches floating around none of them was committed in
> last 4 weeks.

Could you share a list of Jiras you're concerned about? I've seen a
few patches you provided that got committed just fine, and I've seen a
few patches that I thought didn't have a strong justification that
didn't get committed, and I think I've seen a few Jiras that I thought
were a good idea that haven't been committed yet due to outstanding
review feedback or lack of a committer who can volunteer to do the
work.

I'm not saying that the Hadoop process is perfect, far from it, but
from where I sit (like you I'm a contributor but not yet a committer)
it seems to be working OK so far for both you and I. Some things could
be better, but the current fairly-conservative process has the benefit
of keeping trunk in a really sane, safe state.

> They are really simple stuff. I haven't tried to go with some
> more elaborated patch because Bible says: if you fail easy thing, you will
> fail hard thing too.
>
> I am thinking day by day that i really need to fork hadoop otherwise there
> is no way to move it forward where i need it to be.

Forking is tempting, but working with the community is really
powerful. You've got plenty of successful jiras under your belt, let's
just keep on truckin' and build a better Hadoop.

-andy


[jira] [Created] (HDFS-4279) NameNode does not initialize generic conf keys when started with -recover

2012-12-05 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HDFS-4279:
--

 Summary: NameNode does not initialize generic conf keys when 
started with -recover
 Key: HDFS-4279
 URL: https://issues.apache.org/jira/browse/HDFS-4279
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.3-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor


This means that configurations that scope the location of the name/edits/shared 
edits dirs by nameserice or namenode won't work with `hdfs namenode -recover`

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4280) InodeTree.java has redundant check for vName while throwing exception

2012-12-05 Thread Arup Malakar (JIRA)
Arup Malakar created HDFS-4280:
--

 Summary: InodeTree.java has redundant check for vName while 
throwing exception
 Key: HDFS-4280
 URL: https://issues.apache.org/jira/browse/HDFS-4280
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Arup Malakar
Priority: Trivial


{code}
337 if (!gotMountTableEntry) {
338   throw new IOException(
339   "ViewFs: Cannot initialize: Empty Mount table in config for " 
+ 
340  vName == null ? "viewfs:///" : ("viewfs://" + vName + 
"/"));
341 }
{code}

The vName is always non-null due to checks/assignments done prior to this code 
segment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-4271) Problem in DFSInputStream read retry logic may cause early failure

2012-12-05 Thread Binglin Chang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binglin Chang resolved HDFS-4271.
-

Resolution: Duplicate

I didn't aware I have made 3 issues cause the bad internet connection

> Problem in DFSInputStream read retry logic may cause early failure
> --
>
> Key: HDFS-4271
> URL: https://issues.apache.org/jira/browse/HDFS-4271
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Minor
>
> Assume the following call logic
> {noformat} 
> readWithStrategy()
>   -> blockSeekTo()
>   -> readBuffer()
>  -> reader.doRead()
>  -> seekToNewSource() add currentNode to deadnode, wish to get a 
> different datanode
> -> blockSeekTo()
>-> chooseDataNode()
>   -> block missing, clear deadNodes and pick the currentNode again
> seekToNewSource() return false
>  readBuffer() re-throw the exception quit loop
> readWithStrategy() got the exception,  and may fail the read call before 
> tried MaxBlockAcquireFailures.
> {noformat} 
> some issues of the logic:
> 1. seekToNewSource() logic is broken because it may clear deadNodes in the 
> middle.
> 2. the variable "int retries=2" in readWithStrategy seems have conflict with 
> MaxBlockAcquireFailures, should it be removed?
> I write a test to produce the scenario, and here is part of the log:
> {noformat} 
> 2012-12-05 22:55:15,135 WARN  hdfs.DFSClient 
> (DFSInputStream.java:readBuffer(596)) - Found Checksum error for 
> BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from 
> 127.0.0.1:50099 at 0
> 2012-12-05 22:55:15,136 INFO  DataNode.clienttrace 
> (BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: 
> /127.0.0.1:50105, bytes: 4128, op: HDFS_READ, cliID: 
> DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: 
> DS-91625336-192.168.0.101-50099-1354719314603, blockid: 
> BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, 
> duration: 2925000
> 2012-12-05 22:55:15,136 INFO  hdfs.DFSClient 
> (DFSInputStream.java:chooseDataNode(741)) - Could not obtain 
> BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from any 
> node: java.io.IOException: No live nodes contain current block. Will get new 
> block locations from namenode and retry...
> 2012-12-05 22:55:15,136 WARN  hdfs.DFSClient 
> (DFSInputStream.java:chooseDataNode(756)) - DFS chooseDataNode: got # 1 
> IOException, will wait for 274.34891931868265 msec.
> 2012-12-05 22:55:15,413 INFO  DataNode.clienttrace 
> (BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: 
> /127.0.0.1:50106, bytes: 4128, op: HDFS_READ, cliID: 
> DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: 
> DS-91625336-192.168.0.101-50099-1354719314603, blockid: 
> BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, 
> duration: 283000
> 2012-12-05 22:55:15,414 INFO  hdfs.StateChange 
> (FSNamesystem.java:reportBadBlocks(4761)) - *DIR* reportBadBlocks
> 2012-12-05 22:55:15,415 INFO  BlockStateChange 
> (CorruptReplicasMap.java:addToCorruptReplicasMap(66)) - BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_-705068286766485620 added as corrupt 
> on 127.0.0.1:50099 by null because client machine reported it
> 2012-12-05 22:55:15,416 INFO  hdfs.TestClientReportBadBlock 
> (TestDFSInputStream.java:testDFSInputStreamReadRetryTime(94)) - catch 
> IOExceptionorg.apache.hadoop.fs.ChecksumException: Checksum error: /testFile 
> at 0 exp: 809972010 got: -1374622118
> 2012-12-05 22:55:15,431 INFO  hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:shutdown(1411)) - Shutting down the Mini HDFS Cluster
> {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4281) NameNode recovery does not detect NN RPC address on HA cluster

2012-12-05 Thread Stephen Chu (JIRA)
Stephen Chu created HDFS-4281:
-

 Summary: NameNode recovery does not detect NN RPC address on HA 
cluster
 Key: HDFS-4281
 URL: https://issues.apache.org/jira/browse/HDFS-4281
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.0-alpha
Reporter: Stephen Chu
 Attachments: core-site.xml, hdfs-site.xml, nn_recover

On a shut down HA cluster, I ran "hdfs namenode -recover" and encountered:

{code}
bash-4.1$ hdfs namenode -recover
12/12/05 16:43:47 INFO namenode.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = cs-10-20-193-228.cloud.cloudera.com/10.20.193.228
STARTUP_MSG:   args = [-recover]
STARTUP_MSG:   version = 2.0.0-cdh4.1.2
STARTUP_MSG:   classpath = 
/etc/hadoop/conf:/usr/lib/hadoop/lib/jackson-jaxrs-1.8.8.jar:/usr/lib/hadoop/lib/slf4j-api-1.6.1.jar:/usr/lib/hadoop/lib/commons-collections-3.2.1.jar:/usr/lib/hadoop/lib/commons-lang-2\
.5.jar:/usr/lib/hadoop/lib/jasper-runtime-5.5.23.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/hadoop/lib/jsch-0.1.42.jar:/usr/lib/hadoop/lib/jetty-util-6.1.26.cloudera.2.jar:/usr/lib/hadoop/lib/jackson-mappe\
r-asl-1.8.8.jar:/usr/lib/hadoop/lib/commons-cli-1.2.jar:/usr/lib/hadoop/lib/commons-beanutils-core-1.8.0.jar:/usr/lib/hadoop/lib/zookeeper-3.4.3-cdh4.1.2.jar:/usr/lib/hadoop/lib/jsr305-1.3.9.jar:/usr/lib/hadoop/l\
ib/xmlenc-0.52.jar:/usr/lib/hadoop/lib/jetty-6.1.26.cloudera.2.jar:/usr/lib/hadoop/lib/log4j-1.2.17.jar:/usr/lib/hadoop/lib/protobuf-java-2.4.0a.jar:/usr/lib/hadoop/lib/jersey-json-1.8.jar:/usr/lib/hadoop/lib/jun\
it-4.8.2.jar:/usr/lib/hadoop/lib/jettison-1.1.jar:/usr/lib/hadoop/lib/commons-digester-1.8.jar:/usr/lib/hadoop/lib/guava-11.0.2.jar:/usr/lib/hadoop/lib/kfs-0.3.jar:/usr/lib/hadoop/lib/snappy-java-1.0.4.1.jar:/usr\
/lib/hadoop/lib/servlet-api-2.5.jar:/usr/lib/hadoop/lib/mockito-all-1.8.5.jar:/usr/lib/hadoop/lib/stax-api-1.0.1.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar:/usr/lib/hadoop/lib/commons-configuration-1.6.jar:/\
usr/lib/hadoop/lib/paranamer-2.3.jar:/usr/lib/hadoop/lib/jaxb-api-2.2.2.jar:/usr/lib/hadoop/lib/jaxb-impl-2.2.3-1.jar:/usr/lib/hadoop/lib/commons-math-2.1.jar:/usr/lib/hadoop/lib/activation-1.1.jar:/usr/lib/hadoo\
p/lib/commons-io-2.1.jar:/usr/lib/hadoop/lib/commons-beanutils-1.7.0.jar:/usr/lib/hadoop/lib/commons-net-3.1.jar:/usr/lib/hadoop/lib/asm-3.2.jar:/usr/lib/hadoop/lib/commons-codec-1.4.jar:/usr/lib/hadoop/lib/jaspe\
r-compiler-5.5.23.jar:/usr/lib/hadoop/lib/jackson-xc-1.8.8.jar:/usr/lib/hadoop/lib/commons-el-1.0.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop/lib/jline-0.9.94.jar:/usr/lib/hadoop/lib/avro-1\
.7.1.cloudera.2.jar:/usr/lib/hadoop/lib/jersey-core-1.8.jar:/usr/lib/hadoop/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop/lib/commons-logging-1.1.1.jar:/usr/lib/hadoop/lib/jersey-server-1.8.jar:/usr/lib/hadoop/l\
ib/jsp-api-2.1.jar:/usr/lib/hadoop/.//hadoop-auth-2.0.0-cdh4.1.2.jar:/usr/lib/hadoop/.//hadoop-annotations-2.0.0-cdh4.1.2.jar:/usr/lib/hadoop/.//hadoop-common.jar:/usr/lib/hadoop/.//hadoop-auth.jar:/usr/lib/hadoo\
p/.//hadoop-common-2.0.0-cdh4.1.2-tests.jar:/usr/lib/hadoop/.//hadoop-annotations.jar:/usr/lib/hadoop/.//hadoop-common-2.0.0-cdh4.1.2.jar:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/commons-lang-2.5.jar:/usr\
/lib/hadoop-hdfs/lib/jasper-runtime-5.5.23.jar:/usr/lib/hadoop-hdfs/lib/jetty-util-6.1.26.cloudera.2.jar:/usr/lib/hadoop-hdfs/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop-hdfs/lib/commons-cli-1.2.jar:/usr/lib\
/hadoop-hdfs/lib/zookeeper-3.4.3-cdh4.1.2.jar:/usr/lib/hadoop-hdfs/lib/jsr305-1.3.9.jar:/usr/lib/hadoop-hdfs/lib/xmlenc-0.52.jar:/usr/lib/hadoop-hdfs/lib/jetty-6.1.26.cloudera.2.jar:/usr/lib/hadoop-hdfs/lib/commo\
ns-daemon-1.0.3.jar:/usr/lib/hadoop-hdfs/lib/log4j-1.2.17.jar:/usr/lib/hadoop-hdfs/lib/protobuf-java-2.4.0a.jar:/usr/lib/hadoop-hdfs/lib/guava-11.0.2.jar:/usr/lib/hadoop-hdfs/lib/servlet-api-2.5.jar:/usr/lib/hado\
op-hdfs/lib/commons-io-2.1.jar:/usr/lib/hadoop-hdfs/lib/asm-3.2.jar:/usr/lib/hadoop-hdfs/lib/commons-codec-1.4.jar:/usr/lib/hadoop-hdfs/lib/commons-el-1.0.jar:/usr/lib/hadoop-hdfs/lib/jackson-core-asl-1.8.8.jar:/\
usr/lib/hadoop-hdfs/lib/jline-0.9.94.jar:/usr/lib/hadoop-hdfs/lib/jersey-core-1.8.jar:/usr/lib/hadoop-hdfs/lib/commons-logging-1.1.1.jar:/usr/lib/hadoop-hdfs/lib/jersey-server-1.8.jar:/usr/lib/hadoop-hdfs/lib/jsp\
-api-2.1.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs-2.0.0-cdh4.1.2.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs-2.0.0-cdh4.1.2-tests.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs.jar:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-0\
.20-mapreduce/.//*
STARTUP_MSG:   build = 
file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hadoop-2.0.0-cdh4.1.2/src/hadoop-common-project/hadoop-common
 -r f0b53c81cbf56f5955e403b49fcd27afd5f082de; compiled \
by 'jenkins' on Thu Nov  

Re: profiling hdfs write path

2012-12-05 Thread Radim Kolar

YARN-223 
YARN-211 
YARN-210 
MAPREDUCE-4839 
MAPREDUCE-4827 
MAPREDUCE-4594 
MAPREDUCE-3968 
HADOOP-9088 
HADOOP-9041 
HADOOP-8698 


[jira] [Resolved] (HDFS-4281) NameNode recovery does not detect NN RPC address on HA cluster

2012-12-05 Thread Stephen Chu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Chu resolved HDFS-4281.
---

Resolution: Duplicate

Yes, I believe it is. Marking as duplicate.

> NameNode recovery does not detect NN RPC address on HA cluster
> --
>
> Key: HDFS-4281
> URL: https://issues.apache.org/jira/browse/HDFS-4281
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Stephen Chu
> Attachments: core-site.xml, hdfs-site.xml, nn_recover
>
>
> On a shut down HA cluster, I ran "hdfs namenode -recover" and encountered:
> {code}
> You have selected Metadata Recovery mode.  This mode is intended to recover 
> lost metadata on a corrupt filesystem.  Metadata recovery mode often 
> permanently deletes data from your HDFS filesystem.  Please back up\
>  your edit log and fsimage before trying this!
> Are you ready to proceed? (Y/N)
>  (Y or N) Y
> 12/12/05 16:43:48 INFO namenode.MetaRecoveryContext: starting recovery...
> 12/12/05 16:43:48 WARN common.Util: Path /dfs/nn should be specified as a URI 
> in configuration files. Please update hdfs configuration.
> 12/12/05 16:43:48 WARN common.Util: Path /dfs/nn should be specified as a URI 
> in configuration files. Please update hdfs configuration.
> 12/12/05 16:43:48 WARN namenode.FSNamesystem: Only one image storage 
> directory (dfs.namenode.name.dir) configured. Beware of dataloss due to lack 
> of redundant storage directories!
> 12/12/05 16:43:48 INFO util.HostsFileReader: Refreshing hosts 
> (include/exclude) list
> 12/12/05 16:43:48 INFO blockmanagement.DatanodeManager: 
> dfs.block.invalidate.limit=1000
> 12/12/05 16:43:48 INFO blockmanagement.BlockManager: 
> dfs.block.access.token.enable=true
> 12/12/05 16:43:48 INFO blockmanagement.BlockManager: 
> dfs.block.access.key.update.interval=600 min(s), 
> dfs.block.access.token.lifetime=600 min(s), 
> dfs.encrypt.data.transfer.algorithm=null
> 12/12/05 16:43:48 INFO namenode.MetaRecoveryContext: RECOVERY FAILED: caught 
> exception
> java.lang.IllegalStateException: Could not determine own NN ID in namespace 
> 'ha-nn-uri'. Please ensure that this node is one of the machines listed as an 
> NN RPC address, or configure dfs.ha.namenode.id
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:172)
> at 
> org.apache.hadoop.hdfs.HAUtil.getNameNodeIdOfOtherNode(HAUtil.java:155)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createBlockTokenSecretManager(BlockManager.java:323)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.(BlockManager.java:239)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:451)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:416)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:386)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1063)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1135)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
> 12/12/05 16:43:48 FATAL namenode.NameNode: Exception in namenode join
> java.lang.IllegalStateException: Could not determine own NN ID in namespace 
> 'ha-nn-uri'. Please ensure that this node is one of the machines listed as an 
> NN RPC address, or configure dfs.ha.namenode.id
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:172)
> at 
> org.apache.hadoop.hdfs.HAUtil.getNameNodeIdOfOtherNode(HAUtil.java:155)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createBlockTokenSecretManager(BlockManager.java:323)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.(BlockManager.java:239)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:451)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:416)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:386)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1063)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1135)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
> 12/12/05 16:43:48 INFO util.ExitUtil: Exiting with status 1
> 12/12/05 16:43:48 INFO namenode.NameNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at 
> cs-10-20-193-228.cloud.cloudera.com/10.20

[jira] [Created] (HDFS-4282) TestEditLog.testFuzzSequences FAILED in all pre-commit test

2012-12-05 Thread Junping Du (JIRA)
Junping Du created HDFS-4282:


 Summary: TestEditLog.testFuzzSequences FAILED in all pre-commit 
test
 Key: HDFS-4282
 URL: https://issues.apache.org/jira/browse/HDFS-4282
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Junping Du
 Fix For: 3.0.0


Caught non-IOException throwable java.lang.RuntimeException: 
java.io.IOException: Invalid UTF8 at 9871b370d70a  at 
org.apache.hadoop.io.UTF8.toString(UTF8.java:154)  at 
org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.readString(FSImageSerialization.java:200)
  at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$TimesOp.readFields(FSEditLogOp.java:1439)
  at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.decodeOp(FSEditLogOp.java:2399)
  at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.readOp(FSEditLogOp.java:2290)
  at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOpImpl(EditLogFileInputStream.java:177)
  at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOpImpl(EditLogFileInputStream.java:175)
  at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOp(EditLogFileInputStream.java:217)
  at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:72)
  at 
org.apache.hadoop.hdfs.server.namenode.TestEditLog.validateNoCrash(TestEditLog.java:1233)
  at 
org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFuzzSequences(TestEditLog.java:1272)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)  
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)  at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
  at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
  at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
  at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
  at 
org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
  at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
  at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)  at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)  at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)  at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)  at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)  at 
org.junit.runners.ParentRunner.run(ParentRunner.java:236)  at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
  at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
  at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)  
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)  at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
  at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
  at 
org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
  at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
  at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75) 
Caused by: java.io.IOException: Invalid UTF8 at 9871b370d70a  at 
org.apache.hadoop.io.UTF8.readChars(UTF8.java:277)  at 
org.apache.hadoop.io.UTF8.toString(UTF8.java:151)  ... 39 more 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4283) Spurious Test failures in TestEditLog.testFuzzSequences

2012-12-05 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HDFS-4283:
--

 Summary: Spurious Test failures in TestEditLog.testFuzzSequences
 Key: HDFS-4283
 URL: https://issues.apache.org/jira/browse/HDFS-4283
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 2.0.3-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor


testFuzzSequences fails sometimes due to the additional UTF-8 validation added 
in HADOOP-9103.  The issue is that the UTF8 class throws its exceptions as 
{{RuntimeExceptions}} rather than {{IOExceptions}}, and the fuzzing code is not 
expecting that.

Example failure:
{code}
Failed

org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFuzzSequences
Failing for the past 5 builds (Since Failed#3600 )
Took 4.8 sec.
Error Message

Caught non-IOException throwable java.lang.RuntimeException: 
java.io.IOException: Invalid UTF8 at 9871b370d70a  at 
org.apache.hadoop.io.UTF8.toString(UTF8.java:154)  at 
org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.readString(FSImageSerialization.java:200)
  at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$TimesOp.readFields(FSEditLogOp.java:1439)
  at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.decodeOp(FSEditLogOp.java:2399)
  at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.readOp(FSEditLogOp.java:2290)
  at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOpImpl(EditLogFileInputStream.java:177)
  at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOpImpl(EditLogFileInputStream.java:175)
  at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOp(EditLogFileInputStream.java:217)
  at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:72)
  at 
org.apache.hadoop.hdfs.server.namenode.TestEditLog.validateNoCrash(TestEditLog.java:1233)
  at 
org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFuzzSequences(TestEditLog.java:1272)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)  
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)  at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
  at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
  at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
  at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
  at 
org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
  at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
  at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)  at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)  at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)  at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)  at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)  at 
org.junit.runners.ParentRunner.run(ParentRunner.java:236)  at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
  at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
  at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)  
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)  at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
  at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
  at 
org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
  at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
  at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75) 
Caused by: java.io.IOException: Invalid UTF8 at 9871b370d70a  at 
org.apache.hadoop.io.UTF8.readChars(UTF8.java:277)  at 
org.apache.hadoop.io.UTF8.toString(UTF8.java:151)  ... 39 more 
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira