Hadoop-Hdfs-trunk - Build # 609 - Still Failing

2011-03-17 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/609/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 704012 lines...]
[mkdir] Created dir: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target
 [echo]  Including clover.jar in the war file ...
[cactifywar] Analyzing war: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/hdfsproxy-2.0-test.war
[cactifywar] Building war: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/test.war

cactifywar:

test-cactus:
 [echo]  Free Ports: startup-11592 / http-11593 / https-11594
 [echo] Please take a deep breath while Cargo gets the Tomcat for running 
the servlet tests...
[mkdir] Created dir: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/tomcat-config
[mkdir] Created dir: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/tomcat-config/conf
[mkdir] Created dir: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/tomcat-config/webapps
[mkdir] Created dir: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/tomcat-config/temp
[mkdir] Created dir: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/logs
[mkdir] Created dir: 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/reports
 [copy] Copying 1 file to 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/tomcat-config/conf
 [copy] Copying 1 file to 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/tomcat-config/conf
 [copy] Copying 1 file to 
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/tomcat-config/conf
   [cactus] -
   [cactus] Running tests against Tomcat 5.x @ http://localhost:11593
   [cactus] -
   [cactus] Deploying 
[/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/test.war]
 to 
[/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build/contrib/hdfsproxy/target/tomcat-config/webapps]...
   [cactus] Tomcat 5.x starting...
Server [Apache-Coyote/1.1] started
   [cactus] WARNING: multiple versions of ant detected in path for junit 
   [cactus]  
jar:file:/homes/hudson/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
   [cactus]  and 
jar:file:/homes/hudson/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
   [cactus] Running org.apache.hadoop.hdfsproxy.TestAuthorizationFilter
   [cactus] Tests run: 4, Failures: 2, Errors: 0, Time elapsed: 0.486 sec
   [cactus] Test org.apache.hadoop.hdfsproxy.TestAuthorizationFilter FAILED
   [cactus] Running org.apache.hadoop.hdfsproxy.TestLdapIpDirFilter
   [cactus] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.555 sec
   [cactus] Tomcat 5.x started on port [11593]
   [cactus] Running org.apache.hadoop.hdfsproxy.TestProxyFilter
   [cactus] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.311 sec
   [cactus] Running org.apache.hadoop.hdfsproxy.TestProxyForwardServlet
   [cactus] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.343 sec
   [cactus] Running org.apache.hadoop.hdfsproxy.TestProxyUtil
   [cactus] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.875 sec
   [cactus] Tomcat 5.x is stopping...
   [cactus] Tomcat 5.x is stopped

BUILD FAILED
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build.xml:753: 
The following error occurred while executing this line:
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build.xml:734: 
The following error occurred while executing this line:
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/src/contrib/build.xml:49:
 The following error occurred while executing this line:
/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/src/contrib/hdfsproxy/build.xml:343:
 Tests failed!

Total time: 51 minutes 16 seconds
[FINDBUGS] Skipping publisher since build result is FAILURE
Publishing Javadoc
Archiving artifacts
Recording test results
Recording fingerprints
Publishing Clover coverage report...
No Clover report will be published due to a Build Failure
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
2 tests failed.
FAILED:  org.apache.hadoop.hdfsproxy.TestAu

[jira] Created: (HDFS-1764) add 'Time Since Delared Dead' to namenode dead data nodes web page

2011-03-17 Thread Hairong Kuang (JIRA)
add 'Time Since Delared Dead' to namenode dead data nodes web page
--

 Key: HDFS-1764
 URL: https://issues.apache.org/jira/browse/HDFS-1764
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Hairong Kuang
Assignee: Hairong Kuang


I am filing this jira for Andrew. :)

Currently on the dead nodes page of a namenode, we only list the dead 
datanode's hostnames. In addition I would like to list the duration since the 
node was declared as dead, for example in the same format as the 
"Decommissioning Nodes" page when it lists "Time Since Decommissioning Started".

In our Hadoop clusters if a node has only been dead for a few minutes, our 
monitoring is likely to bring the node back without us needing to do anything 
about it. This proposed functionality will help administrators identify which 
nodes need manual attention and which nodes are likely to be fixed by our 
monitoring. If the node has been dead for many hours, it merits a closer look.

This seems like useful functionality for open source as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (HDFS-1765) Block Replication should respect under-replication block priority

2011-03-17 Thread Hairong Kuang (JIRA)
Block Replication should respect under-replication block priority
-

 Key: HDFS-1765
 URL: https://issues.apache.org/jira/browse/HDFS-1765
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 0.23.0
Reporter: Hairong Kuang
Assignee: Hairong Kuang
 Fix For: 0.23.0


Currently under-replicated blocks are assigned different priorities depending 
on how many replicas a block has. However the replication monitor works on 
blocks in a round-robin fashion. So the newly added high priority blocks won't 
get replicated until all low-priority blocks are done. One example is that on 
decommissioning datanode WebUI we often observe that "blocks with only 
decommissioning replicas" do not get scheduled to replicate before other 
blocks, so risking data availability if the node is shutdown for repair before 
decommission completes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


libhdfs not getting compiled

2011-03-17 Thread Aastha Mehta
Hello,

I am working on a project involving hdfs and fuse-dfs API on top of it. I
wanted to trace through the functions called from libhdfs API by fuse-dfs
functions. I added print statements inside the hdfs.c file in appropriate
places to see how the functions progress. I execute ant compile-c++-libhdfs
-Dlibhdfs=1 and then ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1
-Djava5.home=/usr/lib/jvm/java-1.5.0-sun. However, when I use fuse-dfs I
cannot see any of the print statements executed from libhdfs/hdfs.c.

I am using hadoop-0.20.2 version and the libhdfs is present in
hadoop-0.20.2/src/c++/libhdfs. Could someone tell me if this libhdfs is the
one compiled and used or if there will be some other libhdfs that is
accessed. If this is the one, then why are the changes made in its files
reflected on running the code?

Thanks,
Aastha.

-- 
Aastha Mehta
Intern, NetApp, Bangalore
4th year undergraduate, BITS Pilani
E-mail: aasth...@gmail.com


[jira] Created: (HDFS-1766) Datanode is marked dead, but datanode process is alive and verifying blocks

2011-03-17 Thread Hairong Kuang (JIRA)
Datanode is marked dead, but datanode process is alive and verifying blocks
---

 Key: HDFS-1766
 URL: https://issues.apache.org/jira/browse/HDFS-1766
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.23.0
Reporter: Hairong Kuang
Assignee: Hairong Kuang
 Fix For: 0.23.0


We have a datanode marked dead in the namenode, and it is not taking any 
traffic. But it is verifying blocks continuously, so the DataNode process is 
definitely not dead. Jstack shows that the main thread and the offerService 
thread are gone but the JVM stuck at waiting for other threads to die. It seems 
to me that the offerService thread has died abnormally, for example, by a 
runtime exception and it did not shut down other threads before exiting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: libhdfs not getting compiled

2011-03-17 Thread Brian Bockelman
Hi Aastha,

Try using "ldd" against the fuse_dfs executable, and see where you are pulling 
libhdfs.so from.  It may be it is linking from the "wrong one".

Brian

On Mar 17, 2011, at 3:24 PM, Aastha Mehta wrote:

> Hello,
> 
> I am working on a project involving hdfs and fuse-dfs API on top of it. I
> wanted to trace through the functions called from libhdfs API by fuse-dfs
> functions. I added print statements inside the hdfs.c file in appropriate
> places to see how the functions progress. I execute ant compile-c++-libhdfs
> -Dlibhdfs=1 and then ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1
> -Djava5.home=/usr/lib/jvm/java-1.5.0-sun. However, when I use fuse-dfs I
> cannot see any of the print statements executed from libhdfs/hdfs.c.
> 
> I am using hadoop-0.20.2 version and the libhdfs is present in
> hadoop-0.20.2/src/c++/libhdfs. Could someone tell me if this libhdfs is the
> one compiled and used or if there will be some other libhdfs that is
> accessed. If this is the one, then why are the changes made in its files
> reflected on running the code?
> 
> Thanks,
> Aastha.
> 
> -- 
> Aastha Mehta
> Intern, NetApp, Bangalore
> 4th year undergraduate, BITS Pilani
> E-mail: aasth...@gmail.com



smime.p7s
Description: S/MIME cryptographic signature


[jira] Created: (HDFS-1767) Delay second Block Reports until after cluster finishes startup, to improve startup times

2011-03-17 Thread Matt Foley (JIRA)
Delay second Block Reports until after cluster finishes startup, to improve 
startup times
-

 Key: HDFS-1767
 URL: https://issues.apache.org/jira/browse/HDFS-1767
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 0.22.0
Reporter: Matt Foley
Assignee: Matt Foley
 Fix For: 0.23.0


Consider a large cluster that takes 40 minutes to start up.  The datanodes 
compete to register and send their Initial Block Reports (IBRs) as fast as they 
can after startup (subject to a small sub-two-minute random delay, which isn't 
relevant to this discussion).  

As each datanode succeeds in sending its IBR, it schedules the starting time 
for its regular cycle of reports, every hour (or other configured value of 
dfs.blockreport.intervalMsec). In order to spread the reports evenly across the 
block report interval, each datanode picks a random fraction of that interval, 
for the starting point of its regular report cycle.  For example, if a 
particular datanode ends up randomly selecting 18 minutes after the hour, then 
that datanode will send a Block Report at 18 minutes after the hour every hour 
as long as it remains up.  Other datanodes will start their cycles at other 
randomly selected times.  This code is in DataNode.blockReport() and 
DataNode.scheduleBlockReport().

The "second Block Report" (2BR), is the start of these hourly reports.  The 
problem is that some of these 2BRs get scheduled sooner rather than later, and 
actually occur within the startup period.  For example, if the cluster takes 40 
minutes (2/3 of an hour) to start up, then out of the datanodes that succeed in 
sending their IBRs during the first 10 minutes, between 1/2 and 2/3 of them 
will send their 2BR before the 40-minute startup time has completed!

2BRs sent within the startup time actually compete with the remaining IBRs, and 
thereby slow down the overall startup process.  This can be seen in the 
following data, which shows the startup process for a 3700-node cluster that 
took about 17 minutes to finish startup:


  timestarts  sum   regs  sum   IBR  sum  2nd_BR sum total_BRs/min
0   1299799498  3042  3042  1969  1969  151   151  0  151
1   1299799558   665  3707  1470  3439  248   399  0  248
2   12997996183707   224  3663  270   669  0  270
3   1299799678370714  3677  261   9303 3  264
4   1299799738370723  3700  288  12181 4  289
5   12997997983707 7  3707  258  14763 7  261
6   129979985837073707  317  1793411  321
7   129979991837073707  292  2085617  298
8   129979997837073707  292  2377825  300
9   129980003837073707  272  2649 25  272
10  129980009837073707  280  2929   1540  295
11  129980015837073707  223  3152   1454  237
12  129980021837073707  143  3295 54  143
13  129980027837073707  141  3436   2074  161
14  129980033837073707  195  3631   78   152  273
15  129980039837073707   51  3682  209   361  260
16  129980045837073707   25  3707  369   730  394
17  129980051837073707   3707  166   896  166
18  129980057837073707   3707   72   968   72
19  129980063837073707   3707   67  1035   67
20  129980069837073707   3707   75  1110   75
21  129980075837073707   3707   71  1181   71
22  129980081837073707   3707   67  1248   67
23  129980087837073707   3707   62  1310   62
24  129980093837073707   3707   56  1366   56
25  129980099837073707   3707   60  1426   60


This data was harvested from the startup logs of all the datanodes, and 
correlated into one-minute buckets.  Each row of the table represents the 
progress during one elapsed minute of clock time.  It seems that every cluster 
startup is different, but this one showed the effect fairly well.

The "starts" column shows that all the nodes started up within the first 2 
minutes, and the "regs" column shows that all succeeded in registering by 
minute 6.  The IBR column shows a sustained rate of Initial Block Report 
processing of 250-300/minute for the first 10 minutes.

The question is why, during minutes 11 through 16, the rate of IBR processing 
slowed down.  Why didn't the startup just finish?  In the "2nd_BR" column, we 
see the rate of 2BRs ramping up as more datanodes complete their IBRs.  As the 
rate increases, they become more effective at competing with the IBRs, and slow 
down the IBR processing even more.  After the IBRs finally fini