Hey guys, My associates have investigated HDFS federation recently, which, turns out to be a quite good solution for improving scalability on NameNode/DataNode side.
However, we encountered some problem on client-side. Since: A) For historical reason, we use clients in multiple languages to access HDFS, (i.e. python-snakebite, or perhaps libhdfs++). So we either implement multiple versions of ViewFS or we give up the consistency view (which can be confusing to user). B) We have hadoop client configuration deployed on client nodes, which we do not have control over . Also, releasing new configuration could be a real heavy operation because it needs to be pushed to several thousand of nodes, as well as maintaining consistency (say a node is down throughout the operation, then come back online. it could still possess a stale version of configuration). So we intended to explore another solution to these problems, and came up with a proxy model. That is, build a RPC proxy in front of NameNodes. All clients talk to proxy when they need to consult NameNode, then proxy decide which NameNode should the request go to according to mount table. This solved our problem. All clients are seamlessly upgraded with federation support. We open sourced the proxy recently: https://github.com/bytedance/nnproxy (BTW, all kinds of feedbacks are welcomed) But there are still a few issues. For example, several modifications needs to be done inside hadoop ipc to support rpc forwarding. We released patch according to which with nnproxy project ( https://github.com/bytedance/nnproxy/tree/master/hadoop-patches). But it could be better to have these merged to apache trunk. Does someone think it's worth? -- Cheers, Tianyi HE (+86) 185 0042 4096