Re: Merging Namenode Federation feature (HDFS-1052) to trunk
> > > A few questions: > - Do we have a clear definition for a cluster? > Cluster before federation is defined by list of datanodes in include file, bound together by namespaceID of the namenode that these nodes bind to on first registration with the namenode. In essence, namespaceID defines the cluster nodes. In federation cluster namenodes are setup with the same clusterID. ClusterID is established at the datanodes when they first register with a namenode. So nodes with the same clusterID are part of the cluster. - With the above definition, is it an error if not all DNs belong to the > same set of NNs? > A DN has to belong to same set of NNs sharing the same clusterID. DNs cannot register with a namenode that has a different clusterID. > - With the working definition of a cluster, what namespace guarantees are > given to clients? > I am not sure what you mean by this. > >
Re: Merging Namenode Federation feature (HDFS-1052) to trunk
> > But this does make things easier. Although I'm still fairly > confident that it adds too much complexity for little gain though. Allen,can you please add details on what complexity you are talking about here? (I have already asked this question many times) >From code perspective it is not adding complexity, as I have explained before. You could chose to run the cluster with single namenode and not see much difference. But federation does solve in our case complicated setting up of multiple clusters, balancing the storage across the clusters, lack of single view and duplication of data. So put this in the 'agree to disagree' column. It would still be nice if > you guys could lay off the camelCase options though. Admins hate the shift > key. > I did reply to your comment saying the options are case insensitive. > >BTW, Robert C. asked what I thought you guys should have been > working on instead of Federation. I told him (and you) high availability of > the namenode (which I still believe is necessary for HDFS in more and more > cases), but I've had more time to think about it. So expect my list (which > I'll post here) soon. :p > > Federation is solving an important problem for us. We are looking at HA, as you might have seen in some of the jira activities.
Hadoop-Hdfs-trunk - Build # 616 - Still Failing
See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/616/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 708649 lines...] [junit] 2011-03-24 12:22:35,345 INFO datanode.DataNode (BlockReceiver.java:run(914)) - PacketResponder blk_6538719823285349735_1001 0 : Thread is interrupted. [junit] 2011-03-24 12:22:35,344 INFO datanode.DataNode (DataNode.java:shutdown(788)) - Waiting for threadgroup to exit, active threads is 3 [junit] 2011-03-24 12:22:35,344 INFO ipc.Server (Server.java:run(487)) - Stopping IPC Server listener on 34307 [junit] 2011-03-24 12:22:35,345 INFO datanode.DataNode (BlockReceiver.java:run(999)) - PacketResponder 0 for block blk_6538719823285349735_1001 terminating [junit] 2011-03-24 12:22:35,345 INFO datanode.BlockReceiverAspects (BlockReceiverAspects.aj:ajc$after$org_apache_hadoop_hdfs_server_datanode_BlockReceiverAspects$9$725950a6(220)) - FI: blockFileClose, datanode=DatanodeRegistration(127.0.0.1:42931, storageID=DS-1663091717-127.0.1.1-42931-1300969344471, infoPort=45787, ipcPort=34307) [junit] 2011-03-24 12:22:35,346 ERROR datanode.DataNode (DataXceiver.java:run(132)) - DatanodeRegistration(127.0.0.1:42931, storageID=DS-1663091717-127.0.1.1-42931-1300969344471, infoPort=45787, ipcPort=34307):DataXceiver [junit] java.lang.RuntimeException: java.lang.InterruptedException: sleep interrupted [junit] at org.apache.hadoop.fi.FiTestUtil.sleep(FiTestUtil.java:82) [junit] at org.apache.hadoop.fi.DataTransferTestUtil$SleepAction.run(DataTransferTestUtil.java:346) [junit] at org.apache.hadoop.fi.DataTransferTestUtil$SleepAction.run(DataTransferTestUtil.java:1) [junit] at org.apache.hadoop.fi.FiTestUtil$ActionContainer.run(FiTestUtil.java:116) [junit] at org.apache.hadoop.hdfs.server.datanode.BlockReceiverAspects.ajc$before$org_apache_hadoop_hdfs_server_datanode_BlockReceiverAspects$7$b9c2bffe(BlockReceiverAspects.aj:193) [junit] at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:451) [junit] at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:639) [junit] at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:390) [junit] at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390) [junit] at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:332) [junit] at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:130) [junit] at java.lang.Thread.run(Thread.java:662) [junit] Caused by: java.lang.InterruptedException: sleep interrupted [junit] at java.lang.Thread.sleep(Native Method) [junit] at org.apache.hadoop.fi.FiTestUtil.sleep(FiTestUtil.java:80) [junit] ... 11 more [junit] 2011-03-24 12:22:35,347 INFO datanode.DataNode (DataNode.java:shutdown(788)) - Waiting for threadgroup to exit, active threads is 0 [junit] 2011-03-24 12:22:35,448 INFO datanode.DataBlockScanner (DataBlockScanner.java:run(624)) - Exiting DataBlockScanner thread. [junit] 2011-03-24 12:22:35,448 INFO datanode.DataNode (DataNode.java:run(1464)) - DatanodeRegistration(127.0.0.1:42931, storageID=DS-1663091717-127.0.1.1-42931-1300969344471, infoPort=45787, ipcPort=34307):Finishing DataNode in: FSDataset{dirpath='/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build-fi/test/data/dfs/data/data1/current/finalized,/grid/0/hudson/hudson-slave/workspace/Hadoop-Hdfs-trunk/trunk/build-fi/test/data/dfs/data/data2/current/finalized'} [junit] 2011-03-24 12:22:35,448 INFO ipc.Server (Server.java:stop(1626)) - Stopping server on 34307 [junit] 2011-03-24 12:22:35,448 INFO datanode.DataNode (DataNode.java:shutdown(788)) - Waiting for threadgroup to exit, active threads is 0 [junit] 2011-03-24 12:22:35,449 INFO datanode.FSDatasetAsyncDiskService (FSDatasetAsyncDiskService.java:shutdown(133)) - Shutting down all async disk service threads... [junit] 2011-03-24 12:22:35,449 INFO datanode.FSDatasetAsyncDiskService (FSDatasetAsyncDiskService.java:shutdown(142)) - All async disk service threads have been shut down. [junit] 2011-03-24 12:22:35,449 WARN datanode.FSDatasetAsyncDiskService (FSDatasetAsyncDiskService.java:shutdown(130)) - AsyncDiskService has already shut down. [junit] 2011-03-24 12:22:35,551 WARN namenode.FSNamesystem (FSNamesystem.java:run(2856)) - ReplicationMonitor thread received InterruptedException.java.lang.InterruptedException: sleep interrupted [junit] 2011-03-24 12:22:35,551 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(559)) - Number of transactions: 6 Total time for transactions(ms): 0Number of transactio
[HDFS-1120] An interface suggestion for volume-choice for storing a block
Hello, I've done some initial work for making the volume-block choosing policy pluggable (so that methods other than round-robin may be provided). The following is my initial interface design, and am looking for comments/critique/etc. w.r.t. https://issues.apache.org/jira/browse/HDFS-1120's scope, before I start pushing out some tests + patches: /** * BlockVolumeChoosingPolicy allows a DataNode to * specify what policy is to be used while choosing * a volume for a block request. * ***/ public interface BlockVolumeChoosingPolicy extends Configurable { /** * Returns a specific FSVolume after applying a suitable choice algorithm * to place a given block, given a list of FSVolumes and the block * size sought for storage. * @param volumes - the array of FSVolumes that are available. * @param blockSize - the size of the block for which a volume is sought. * @return the chosen volume to store the block. * @throws IOException when disks are unavailable or are full. */ public FSVolume chooseVolume(FSVolume[] volumes, long blockSize) throws IOException; } This can be neatly used within FSVolumeSet.getNextVolume() [Maybe this too needs to be renamed, since it may not make sense as 'next' once it becomes pluggable] Looking forward to a discussion. -- Harsh J http://harshj.com
[jira] [Resolved] (HDFS-1773) Remove a datanode from cluster if include list is not empty and this datanode is removed from both include and exclude lists
[ https://issues.apache.org/jira/browse/HDFS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE resolved HDFS-1773. -- Resolution: Fixed Fix Version/s: (was: 0.20.4) 0.20.204 Hadoop Flags: [Reviewed] I have committed this to 0.20-security. Thanks, Tanping! > Remove a datanode from cluster if include list is not empty and this datanode > is removed from both include and exclude lists > > > Key: HDFS-1773 > URL: https://issues.apache.org/jira/browse/HDFS-1773 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 0.20.203.1 > Environment: branch-20-security >Reporter: Tanping Wang >Assignee: Tanping Wang >Priority: Minor > Fix For: 0.20.204 > > Attachments: HDFS-1773-2.patch, HDFS-1773-3.patch, HDFS-1773.patch > > > Our service engineering team who operates the clusters on a daily basis > founds it is confusing that after a data node is decommissioned, there is no > way to make the cluster forget about this data node and it always remains in > the dead node list. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1781) jsvc executable delivered into wrong package...
jsvc executable delivered into wrong package... --- Key: HDFS-1781 URL: https://issues.apache.org/jira/browse/HDFS-1781 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.22.0 Reporter: John George Assignee: John George The jsvc executable is delivered in the 0.22 hdfs package, but the script that uses it (bin/hdfs) refers to $HADOOP_HOME/bin/jsvc to find it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1782) FSNamesystem.startFileInternal(..) throws NullPointerException
FSNamesystem.startFileInternal(..) throws NullPointerException -- Key: HDFS-1782 URL: https://issues.apache.org/jira/browse/HDFS-1782 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.22.0 Reporter: John George Assignee: John George Fix For: 0.22.0 I'm observing when there is one balancer running trying to run another one results in "Java.lang.NullPointerException" error. I was hoping to see message "Another balancer is running. Exiting Exiting ...". This is a reproducible issue. Details 1) Cluster ->elrond [hdfs@gsbl90568 smilli]$ hadoop version Hadoop 0.22.0.1102280202 Subversion git://hadoopre5.corp.sk1.yahoo.com/home/y/var/builds/thread2/workspace/Cloud-HadoopCOMMON-0.22-Secondary -r c7c9a21d7289e29f0133452acf8b761e455a84b5 Compiled by hadoopqa on Mon Feb 28 02:12:38 PST 2011 >From source with checksum 9ecbc6f17e8847a1cddca2282dbd9b31 [hdfs@gsbl90568 smilli]$ 2) Run first balancer [hdfs@gsbl90565 smilli]$ hdfs balancer 11/03/09 16:33:56 INFO balancer.Balancer: namenodes = [gsbl90565.blue.ygrid.yahoo.com/98.137.97.57:8020, gsbl90569.blue.ygrid.yahoo.com/98.137.97.53:8020] 11/03/09 16:33:56 INFO balancer.Balancer: p = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 11/03/09 16:33:57 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 11/03/09 16:33:57 INFO balancer.Balancer: Block token params received from NN: keyUpdateInterval=600 min(s), tokenLifetime=600 min(s) 11/03/09 16:33:57 INFO block.BlockTokenSecretManager: Setting block keys 11/03/09 16:33:57 INFO balancer.Balancer: Balancer will update its block keys every 150 minute(s) 11/03/09 16:33:57 INFO block.BlockTokenSecretManager: Setting block keys 11/03/09 16:33:57 INFO balancer.Balancer: Block token params received from NN: keyUpdateInterval=600 min(s), tokenLifetime=600 min(s) 11/03/09 16:33:57 INFO block.BlockTokenSecretManager: Setting block keys 11/03/09 16:33:57 INFO balancer.Balancer: Balancer will update its block keys every 150 minute(s) 11/03/09 16:33:57 INFO block.BlockTokenSecretManager: Setting block keys 11/03/09 16:33:57 INFO net.NetworkTopology: Adding a new node: /98.137.97.0/98.137.97.62:1004 11/03/09 16:33:57 INFO net.NetworkTopology: Adding a new node: /98.137.97.0/98.137.97.58:1004 11/03/09 16:33:57 INFO net.NetworkTopology: Adding a new node: /98.137.97.0/98.137.97.60:1004 11/03/09 16:33:57 INFO net.NetworkTopology: Adding a new node: /98.137.97.0/98.137.97.59:1004 11/03/09 16:33:57 INFO balancer.Balancer: 1 over-utilized: [Source[98.137.97.62:1004, utilization=24.152507825759344]] 11/03/09 16:33:57 INFO balancer.Balancer: 0 underutilized: [] 11/03/09 16:33:57 INFO balancer.Balancer: Need to move 207.98 GB to make the cluster balanced. 11/03/09 16:33:57 INFO balancer.Balancer: Decided to move 10 GB bytes from 98.137.97.62:1004 to 98.137.97.58:1004 11/03/09 16:33:57 INFO balancer.Balancer: Will move 10 GB in this iteration Mar 9, 2011 4:33:57 PM0 0 KB 207.98 GB 10 GB . . . 11/03/09 16:34:36 INFO balancer.Balancer: Moving block -63570336576981940 from 98.137.97.62:1004 to 98.137.97.59:1004 through 98.137.97.62:1004 is succeeded. 11/03/09 16:34:39 INFO balancer.Balancer: Moving block 2379736326585824737 from 98.137.97.62:1004 to 98.137.97.59:1004 through 98.137.97.62:1004 is succeeded. 11/03/09 16:35:21 INFO balancer.Balancer: Moving block 8884583953927078028 from 98.137.97.62:1004 to 98.137.97.59:1004 through 98.137.97.62:1004 is succeeded. 11/03/09 16:35:24 INFO balancer.Balancer: Moving block -135758138424743964 from 98.137.97.62:1004 to 98.137.97.59:1004 through 98.137.97.62:1004 is succeeded. 11/03/09 16:35:27 INFO balancer.Balancer: Moving block -4598153351946352185 from 98.137.97.62:1004 to 98.137.97.59:1004 through 98.137.97.62:1004 is succeeded. 11/03/09 16:35:33 INFO balancer.Balancer: Moving block 2966087210491094643 from 98.137.97.62:1004 to 98.137.97.59:1004 through 98.137.97.62:1004 is succeeded. 11/03/09 16:35:42 INFO balancer.Balancer: Moving block -5573983508500804184 from 98.137.97.62:1004 to 98.137.97.59:1004 through 98.137.97.62:1004 is succeeded. 11/03/09 16:35:58 INFO balancer.Balancer: Moving block -6222779741597113957 from 98.137.97.62:1004 to 98.137.97.59:1004 through 98.137.97.62:1004 is succeeded. 3) Run another balancer observe [hdfs@gsbl90568 smilli]$ hdfs balancer 11/03/09 16:34:32 INFO balancer.Balancer: namenodes = [gsbl90565.blue.ygrid.yahoo.com/98.137.97.57:8020, gsbl90569.blue.ygrid.yahoo.com/98.137.97.53:8020] 11/03/09 16:34:32 INFO balancer.Balancer: p = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0] Time Stamp Iteration# Bytes Alre
[jira] [Created] (HDFS-1783) Ability for HDFS client to write replicas in parallel
Ability for HDFS client to write replicas in parallel - Key: HDFS-1783 URL: https://issues.apache.org/jira/browse/HDFS-1783 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client Reporter: dhruba borthakur Assignee: dhruba borthakur The current implementation of HDFS pipelines the writes to the three replicas. This introduces some latency for realtime latency sensitive applications. An alternate implementation that allows the client to write all replicas in parallel gives much better response times to these applications. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1785) Cleanup BlockReceiver and DataXceiver
Cleanup BlockReceiver and DataXceiver - Key: HDFS-1785 URL: https://issues.apache.org/jira/browse/HDFS-1785 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Tsz Wo (Nicholas), SZE {{clientName.length()}} is used multiple times for determining whether the source is a client or a datanode. {code} if (clientName.length() == 0) { //it is a datanode } {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira