On Apr 25, 2011, at 2:36 PM, Doug Cutting wrote:
A couple of questions:
1. Can you please describe the significant advantages this approach
has
over a symlink-based approach?
It seems to me that one could run multiple namenodes on separate boxes
and run multile datanode processes per storage box configured with
something like:
.......
Doug,
There are two separate issues; your email seems to suggest that these
are joined.
(1) creating (or not ) a unified namespace
(2) sharing the storage and the block storage layer across NameNodes -
the architecture document covers this layering in great detail.
This separation reflects architecture of HDFS (derived from GFS) where
the namespace layer is separate from the block storage layer (although
the HDFS implementation violates the layers in many places).
HDFS-1052 deals with (2) - allowing multiple NameNodes to share the
block storage layer.
As far as (1), creating a unified namespace, federation does NOT
dictate how you create a unified namespace or whether you even create
a unified namespace in the first place. Indeed you may want to share
the physical storage but want independent namespaces. For example, you
may want to run a private namespace for HBase files within the same
Hadoop cluster. Two different tenants sharing a cluster may choose to
have their independent namespaces for isolation.
Of course in many situations one wants to create a unified namespace.
One could create a unified namespace using symbolic links as you
suggest. The federation work has also added client-side mount tables
(HDFS-1053) (it is an implementation of FileSystem and
AbstractFileSystem). It offers advantages over symbolic links but this
is separable and you can use symbolic links if you like. HDFS-1053
(client-side mount tables) makes no changes to any existing file system.
Now getting to (2), sharing the physical storage and the block
storage layer.
The approach you describe (run multiple DNs on the same machine which
is essentially multiple super-imposed HDFS clusters)
is the most common reaction to this work and one which we also explored.
Unfortunately this approach runs into several issues and when you
start exploring the details you realize that it is essentially a hack.
- Extra processes running the DN on the same machine taking precious
memory away from MR tasks.
- Independent pools of threads for each DN
- Not being able to schedule disk operations across multiple DNs
- Not being able to provide a unified view of balancing or
decommissioning. For example, one could run multiple balancers but
this will give you less control of bandwidth used for balancing.
- The disk-fail-in-place work and the balance-disks-on-introducing-a-
new-disk would become more complicated to coordinate across DNs.
- Federation allows the cluster to be managed as a unit rather then as
a a bunch of overlapping HDFS clusters. Overlapping HDFS clusters will
be operationally taxing.
On the other hand, the new architecture generalizes the block storage
layer and allow us to evolve it to address new needs. For example, it
will allow us to address issues like offering tmp storage for
intermediate MR output - one can allocate a block pool for MR tmp
storage on each DN. HBase could also use the block storage layer
directly without going through a name node.
2. .... The patch modifies much
of the logic of Hadoop's central component, upon which the performance
and reliability of most other components of the ecosystem depend.
Changes to the code base
- The fundamental code change is to extend the notion of block id to
now include a block pool id.
- The NN had little change, the protocols did change to include the
block pool id.
- The DN code did change. Each data structure is now indexed by the
block pool id -- while this is a code change, it is architecturally
very simple and low risk.
- We also did a fair amount of cleanup of threads used to send block
reports - while it was not strictly necessary to do the cleanup we
took the extra effort to pay the technical debt. As Dhruba recently
noted, adding support to send block reports to primary and secondary
NN for HA will be now much easier to do.
The write and read pipelines - which are performance critical, have
NOT changed.
It seems to me that such an invasive change should be well tested
before it
is merged to trunk. Can you please tell me how this has been tested
beyond unit tests?
Risk, Quality & Testing
Besides the amount of code change one has to ask the fundamental
question: how good is the design and how is the project managed.
Conceptually, federation is very simple: pools of blocks are owned by
a service (a NN in this case) and the block id is extended by an
identifier called the block-pool id.
First and foremost - we wrote a very extensive architecture document -
more comprehensive than any other document in Hadoop in the past.
This was published very early: version 1 in march 2010 and version 5
in april 2010 based on feedback we received from the community. We
sought and incorporated feedback from other HDFs developers outside of
Yahoo.
The project was managed as a separate branch rather than introduce the
code to trunk incrementally.
The branch has also been tested as a separate unit by us - this
ensures that it does not destabilize trunk.
More details on testing.
The same QA process that drove and tested key stable Apache Hadoop
releases (16, 17, 18, 20, 20-security) is being used for testing the
federation feature. We have been running integrated tests with
federation for a few months and continue to do so.
We will not deploy a Hadoop release with the federation feature in
Yahoo clusters until we are confident that it is stable and reliable
for our clusters. Indeed the level of testing is significantly more
than in previous releases.
Hopefully the above addresses your concerns.
regards
sanjay