Hi Hadoop community,

I am a Ph.D student in North Carolina State University. I am modifying the 
Hadoop's code (which including most parts of Hadoop, e.g. JobTracker, 
TaskTracker, NameNode, DataNode) to achieve better security.

My major goal is that make Hadoop running more secure in the Cloud environment, 
especially for public Cloud environment. In order to achieve that, I redesign 
the currently security mechanism and achieve following proprieties:

1. Bring byte-level access control to Hadoop HDFS. Based on 0.20.204, HDFS 
access control is based on user or block granularity, e.g. HDFS Delegation 
Token only check if the file can be accessed by certain user or not, Block 
Token only proof which block or blocks can be accessed. I make Hadoop can do 
byte-granularity access control, each access party, user or task process can 
only access the bytes she or he least needed.

2. I assume that in the public Cloud environment, only Namenode, secondary 
Namenode, JobTracker can be trusted. A large number of Datanode and TaskTracker 
may be compromised due to some of them may be running under less secure 
environment. So I re-design the secure mechanism to make the damage the hacker 
can do to be minimized.

a. Re-design the Block Access Token to solve wildly shared-key problem of HDFS. 
In original Block Access Token design, all HDFS (Namenode and Datanode) share 
one master key to generate Block Access Token, if one DataNode is compromised 
by hacker, the hacker can get the key and generate any  Block Access Token he 
or she want.

b. Re-design the HDFS Delegation Token to do fine-grain access control for 
TaskTracker and Map-Reduce Task process on HDFS. 

In the Hadoop 0.20.204, all TaskTrackers can use their kerberos credentials to 
access any files for MapReduce on HDFS. So they have the same privilege as 
JobTracker to do read or write tokens, copy job file, etc.. However, if one of 
them is compromised, every critical thing in MapReduce directory (job file, 
Delegation Token) is exposed to attacker. I solve the problem by making 
JobTracker to decide which TaskTracker can access which file in MapReduce 
Directory on HDFS.

For Task process, once it get HDFS Delegation Token, it can access everything 
belong to this job or user on HDFS. By my design, it can only access the bytes 
it needed from HDFS.

There are some other improvement in the security, such as TaskTracker can not 
know some information like blockID from the Block Token (because it is 
encrypted by my way), and HDFS can set up secure channel to send data as a 
option.

By those features, Hadoop can run much securely under uncertain environment 
such as Public Cloud. I already start to test my prototype. I want to know that 
whether community is interesting about my work? Is that a value work to 
contribute to production Hadoop?

I created JIRA for the discussion. 
https://issues.apache.org/jira/browse/HADOOP-8803#comment-13455025 

Thanks,

Xianqing 

Reply via email to