> On 14 May 2015, at 19:34, Rahul Shrivastava <rhshr...@gmail.com> wrote:
> 
> Hi All,
> 
> I have a requirement whereby I have to extend the functionality of HDFS to
> filter out sensitive information ( SSN, Bank Account ) during data read.
> The solution has to be done at the API definition layer ( API of
> FSDataInputStream) such that it works with all our existing ETL programs.

Sanitise the data in some post-ingest MR operation and give your ETL programs 
the path to the sanitised data. The sensitive data can be stored under one 
account; after the job run as that user saves its output, that can be 
copied/shared. With a pipeline like this, you'll know that the untrusted client 
applications don't have access to it.

> I looked into FSDataInputStream or (
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html)
> and DistributedFileSystem (
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#open(org.apache.hadoop.fs.Path)
> .

> One possibility is to read the stream from FSDataInputStream and sanitize
> the stream by removing the SSN and then create an new FSDataInputStream and
> provide this new FSDataInputStream back to the client. Could someone
> provide me input as to any better way to achieve the same. I am hoping to
> build an extension to HDFS api ( FSDataInputStream) .
> 

my input: Dont.

If you do try to do this you have the problem of ensuring single byte read() 
operations on the sensitive data return some marker character, e.g. "0", so you 
need to know where all the data is in advance of those reads. You can't simply 
strip out the data without altering the operations to query the length of a 
file, and the seek() call so that they are consistent

You'll also need to distinguish "sensitive files", where the data needs to be 
filtered, from normal files where the filtering doesn't.

But that won't be enough. Because if the data is being filtered client-side, 
then any program loading the original HDFS client classes will be able to 
bypass your security features. And there won't be any information collected on 
the server to indicate whether or not this has taken place -the client 
operations will look the same in both cases. Which means you won't have an 
audit trail, and be unable to state with any confidence that the sensitive data 
has been filtered.

Other than I said at the beginning, you have limited options

-attempt to do the fitering on the datanodes. DNs handling block reads don't 
know the caller identity, only that the callers have a valid token to access a 
block. Therefore you can't distinguish trusted vs untrusted callers. And when 
you switch to at-rest encryption, the DNs don't have access to the unencrypted 
data.

-encrypt the sensitive parts of the datas on ingest, don't give untrusted code 
the decryption keys. This would work provided you implemented the code 
correctly and ensured that the decryption keys never got to the untrusted 
accounts -that is, you've just created a key management problem. There's 
another subtlety that even with encrypted data, if the encrypted fields have 
the same value across different records, client code could correlate records to 
individual accounts & users without SSN access, which is the first step towards 
deanonymization.

Accordingly, leave the HDFS source alone, go write that data sanitisation job.

-Steve




Reply via email to