On Apr 25, 2011, at 2:36 PM, Doug Cutting wrote:

A couple of questions:

1. Can you please describe the significant advantages this approach has
over a symlink-based approach?
It seems to me that one could run multiple namenodes on separate boxes
and run multile datanode processes per storage box configured with
something like:
.......

Doug,

There are two separate issues; your email seems to suggest that these are joined.
(1) creating (or not ) a unified namespace
(2) sharing the storage and the block storage layer across NameNodes - the architecture document covers this layering in great detail. This separation reflects architecture of HDFS (derived from GFS) where the namespace layer is separate from the block storage layer (although the HDFS implementation violates the layers in many places).


HDFS-1052 deals with (2) - allowing multiple NameNodes to share the block storage layer.

As far as (1), creating a unified namespace, federation does NOT dictate how you create a unified namespace or whether you even create a unified namespace in the first place. Indeed you may want to share the physical storage but want independent namespaces. For example, you may want to run a private namespace for HBase files within the same Hadoop cluster. Two different tenants sharing a cluster may choose to have their independent namespaces for isolation.

Of course in many situations one wants to create a unified namespace. One could create a unified namespace using symbolic links as you suggest. The federation work has also added client-side mount tables (HDFS-1053) (it is an implementation of FileSystem and AbstractFileSystem). It offers advantages over symbolic links but this is separable and you can use symbolic links if you like. HDFS-1053 (client-side mount tables) makes no changes to any existing file system.

Now getting to (2), sharing the physical storage and the block storage layer. The approach you describe (run multiple DNs on the same machine which is essentially multiple super-imposed HDFS clusters)
is the most common reaction to this work and one which we also explored.
Unfortunately this approach runs into several issues and when you start exploring the details you realize that it is essentially a hack. - Extra processes running the DN on the same machine taking precious memory away from MR tasks.
- Independent pools of threads for each DN
- Not being able to schedule disk operations across multiple DNs
- Not being able to provide a unified view of balancing or decommissioning. For example, one could run multiple balancers but this will give you less control of bandwidth used for balancing. - The disk-fail-in-place work and the balance-disks-on-introducing-a- new-disk would become more complicated to coordinate across DNs. - Federation allows the cluster to be managed as a unit rather then as a a bunch of overlapping HDFS clusters. Overlapping HDFS clusters will be operationally taxing.

On the other hand, the new architecture generalizes the block storage layer and allow us to evolve it to address new needs. For example, it will allow us to address issues like offering tmp storage for intermediate MR output - one can allocate a block pool for MR tmp storage on each DN. HBase could also use the block storage layer directly without going through a name node.


2. ....  The patch modifies much
of the logic of Hadoop's central component, upon which the performance
and reliability of most other components of the ecosystem depend.

Changes to the code base
- The fundamental code change is to extend the notion of block id to now include a block pool id. - The NN had little change, the protocols did change to include the block pool id. - The DN code did change. Each data structure is now indexed by the block pool id -- while this is a code change, it is architecturally very simple and low risk. - We also did a fair amount of cleanup of threads used to send block reports - while it was not strictly necessary to do the cleanup we took the extra effort to pay the technical debt. As Dhruba recently noted, adding support to send block reports to primary and secondary NN for HA will be now much easier to do.

The write and read pipelines - which are performance critical, have NOT changed.

It seems to me that such an invasive change should be well tested before it
is merged to trunk.  Can you please tell me how this has been tested
beyond unit tests?


Risk, Quality & Testing
Besides the amount of code change one has to ask the fundamental question: how good is the design and how is the project managed. Conceptually, federation is very simple: pools of blocks are owned by a service (a NN in this case) and the block id is extended by an identifier called the block-pool id. First and foremost - we wrote a very extensive architecture document - more comprehensive than any other document in Hadoop in the past. This was published very early: version 1 in march 2010 and version 5 in april 2010 based on feedback we received from the community. We sought and incorporated feedback from other HDFs developers outside of Yahoo.

The project was managed as a separate branch rather than introduce the code to trunk incrementally. The branch has also been tested as a separate unit by us - this ensures that it does not destabilize trunk.

More details on testing.
The same QA process that drove and tested key stable Apache Hadoop releases (16, 17, 18, 20, 20-security) is being used for testing the federation feature. We have been running integrated tests with federation for a few months and continue to do so. We will not deploy a Hadoop release with the federation feature in Yahoo clusters until we are confident that it is stable and reliable for our clusters. Indeed the level of testing is significantly more than in previous releases.

Hopefully the above addresses your concerns.

regards
sanjay

Reply via email to