> On 14 May 2015, at 19:34, Rahul Shrivastava <rhshr...@gmail.com> wrote: > > Hi All, > > I have a requirement whereby I have to extend the functionality of HDFS to > filter out sensitive information ( SSN, Bank Account ) during data read. > The solution has to be done at the API definition layer ( API of > FSDataInputStream) such that it works with all our existing ETL programs.
Sanitise the data in some post-ingest MR operation and give your ETL programs the path to the sanitised data. The sensitive data can be stored under one account; after the job run as that user saves its output, that can be copied/shared. With a pipeline like this, you'll know that the untrusted client applications don't have access to it. > I looked into FSDataInputStream or ( > https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html) > and DistributedFileSystem ( > https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#open(org.apache.hadoop.fs.Path) > . > One possibility is to read the stream from FSDataInputStream and sanitize > the stream by removing the SSN and then create an new FSDataInputStream and > provide this new FSDataInputStream back to the client. Could someone > provide me input as to any better way to achieve the same. I am hoping to > build an extension to HDFS api ( FSDataInputStream) . > my input: Dont. If you do try to do this you have the problem of ensuring single byte read() operations on the sensitive data return some marker character, e.g. "0", so you need to know where all the data is in advance of those reads. You can't simply strip out the data without altering the operations to query the length of a file, and the seek() call so that they are consistent You'll also need to distinguish "sensitive files", where the data needs to be filtered, from normal files where the filtering doesn't. But that won't be enough. Because if the data is being filtered client-side, then any program loading the original HDFS client classes will be able to bypass your security features. And there won't be any information collected on the server to indicate whether or not this has taken place -the client operations will look the same in both cases. Which means you won't have an audit trail, and be unable to state with any confidence that the sensitive data has been filtered. Other than I said at the beginning, you have limited options -attempt to do the fitering on the datanodes. DNs handling block reads don't know the caller identity, only that the callers have a valid token to access a block. Therefore you can't distinguish trusted vs untrusted callers. And when you switch to at-rest encryption, the DNs don't have access to the unencrypted data. -encrypt the sensitive parts of the datas on ingest, don't give untrusted code the decryption keys. This would work provided you implemented the code correctly and ensured that the decryption keys never got to the untrusted accounts -that is, you've just created a key management problem. There's another subtlety that even with encrypted data, if the encrypted fields have the same value across different records, client code could correlate records to individual accounts & users without SSN access, which is the first step towards deanonymization. Accordingly, leave the HDFS source alone, go write that data sanitisation job. -Steve