Hey guys,

My associates have investigated HDFS federation recently, which, turns out
to be a quite good solution for improving scalability on NameNode/DataNode
side.

However, we encountered some problem on client-side. Since:
A) For historical reason, we use clients in multiple languages to access
HDFS, (i.e. python-snakebite, or perhaps libhdfs++). So we either implement
multiple versions of ViewFS or we give up the consistency view (which can
be confusing to user).
B) We have hadoop client configuration deployed on client nodes, which we
do not have control over . Also, releasing new configuration could be a
real heavy operation because it needs to be pushed to several thousand of
nodes, as well as maintaining consistency (say a node is down throughout
the operation, then come back online. it could still possess a stale
version of configuration).

So we intended to explore another solution to these problems, and came up
with a proxy model.
That is, build a RPC proxy in front of NameNodes.
All clients talk to proxy when they need to consult NameNode, then proxy
decide which NameNode should the request go to according to mount table.
This solved our problem. All clients are seamlessly upgraded with
federation support.
We open sourced the proxy recently: https://github.com/bytedance/nnproxy
(BTW, all kinds of feedbacks are welcomed)

But there are still a few issues. For example, several modifications needs
to be done inside hadoop ipc to support rpc forwarding. We released patch
according to which with nnproxy project (
https://github.com/bytedance/nnproxy/tree/master/hadoop-patches). But it
could be better to have these merged to apache trunk. Does someone think
it's worth?


-- 
Cheers,
Tianyi HE
(+86) 185 0042 4096

Reply via email to