Re: [Discuss] Merge federation branch HDFS-1052 into trunk

Sanjay Radia Tue, 26 Apr 2011 17:27:30 -0700

On Apr 25, 2011, at 2:36 PM, Doug Cutting wrote:


A couple of questions:

1. Can you please describe the significant advantages this approachhas

over a symlink-based approach?
It seems to me that one could run multiple namenodes on separate boxes
and run multile datanode processes per storage box configured with
something like:
.......


Doug,

There are two separate issues; your email seems to suggest that theseare joined.

(1) creating (or not ) a unified namespace

(2) sharing the storage and the block storage layer across NameNodes -the architecture document covers this layering in great detail.This separation reflects architecture of HDFS (derived from GFS) wherethe namespace layer is separate from the block storage layer (althoughthe HDFS implementation violates the layers in many places).

HDFS-1052 deals with (2) - allowing multiple NameNodes to share theblock storage layer.

As far as (1), creating a unified namespace, federation does NOTdictate how you create a unified namespace or whether you even createa unified namespace in the first place. Indeed you may want to sharethe physical storage but want independent namespaces. For example, youmay want to run a private namespace for HBase files within the sameHadoop cluster. Two different tenants sharing a cluster may choose tohave their independent namespaces for isolation.

Of course in many situations one wants to create a unified namespace.One could create a unified namespace using symbolic links as yousuggest. The federation work has also added client-side mount tables(HDFS-1053) (it is an implementation of FileSystem andAbstractFileSystem). It offers advantages over symbolic links but thisis separable and you can use symbolic links if you like. HDFS-1053(client-side mount tables) makes no changes to any existing file system.

Now getting to (2), sharing the physical storage and the blockstorage layer.The approach you describe (run multiple DNs on the same machine whichis essentially multiple super-imposed HDFS clusters)

is the most common reaction to this work and one which we also explored.

Unfortunately this approach runs into several issues and when youstart exploring the details you realize that it is essentially a hack.- Extra processes running the DN on the same machine taking preciousmemory away from MR tasks.

- Independent pools of threads for each DN
- Not being able to schedule disk operations across multiple DNs

- Not being able to provide a unified view of balancing ordecommissioning. For example, one could run multiple balancers butthis will give you less control of bandwidth used for balancing.- The disk-fail-in-place work and the balance-disks-on-introducing-a-new-disk would become more complicated to coordinate across DNs.- Federation allows the cluster to be managed as a unit rather then asa a bunch of overlapping HDFS clusters. Overlapping HDFS clusters willbe operationally taxing.

On the other hand, the new architecture generalizes the block storagelayer and allow us to evolve it to address new needs. For example, itwill allow us to address issues like offering tmp storage forintermediate MR output - one can allocate a block pool for MR tmpstorage on each DN. HBase could also use the block storage layerdirectly without going through a name node.


2. ....  The patch modifies much
of the logic of Hadoop's central component, upon which the performance
and reliability of most other components of the ecosystem depend.


Changes to the code base

- The fundamental code change is to extend the notion of block id tonow include a block pool id.- The NN had little change, the protocols did change to include theblock pool id.- The DN code did change. Each data structure is now indexed by theblock pool id -- while this is a code change, it is architecturallyvery simple and low risk.- We also did a fair amount of cleanup of threads used to send blockreports - while it was not strictly necessary to do the cleanup wetook the extra effort to pay the technical debt. As Dhruba recentlynoted, adding support to send block reports to primary and secondaryNN for HA will be now much easier to do.

The write and read pipelines - which are performance critical, haveNOT changed.

It seems to me that such an invasive change should be well testedbefore it
is merged to trunk.  Can you please tell me how this has been tested
beyond unit tests?



Risk, Quality & Testing

Besides the amount of code change one has to ask the fundamentalquestion: how good is the design and how is the project managed.Conceptually, federation is very simple: pools of blocks are owned bya service (a NN in this case) and the block id is extended by anidentifier called the block-pool id.First and foremost - we wrote a very extensive architecture document -more comprehensive than any other document in Hadoop in the past.This was published very early: version 1 in march 2010 and version 5in april 2010 based on feedback we received from the community. Wesought and incorporated feedback from other HDFs developers outside ofYahoo.

The project was managed as a separate branch rather than introduce thecode to trunk incrementally.The branch has also been tested as a separate unit by us - thisensures that it does not destabilize trunk.


More details on testing.

The same QA process that drove and tested key stable Apache Hadoopreleases (16, 17, 18, 20, 20-security) is being used for testing thefederation feature. We have been running integrated tests withfederation for a few months and continue to do so.We will not deploy a Hadoop release with the federation feature inYahoo clusters until we are confident that it is stable and reliablefor our clusters. Indeed the level of testing is significantly morethan in previous releases.


Hopefully the above addresses your concerns.

regards
sanjay

Re: [Discuss] Merge federation branch HDFS-1052 into trunk

Reply via email to