Thanks Uma for putting together this proposal. Overall sounds good to me, +1 for these improvements. A few comments/questions:
* If it becomes part of Apache Commons, could we make Chimera a separate JAR? We have real difficulties bumping dependency versions right now, so ideally we don't need to bump our existing Commons dependencies to use Chimera. * With this refactoring, do we have confidence that we can get our desired changes merged and released in a timely fashion? e.g. if we find another bug like HADOOP-11343, we'll first need to get the fix into Chimera, have a new Chimera release, then bump Hadoop's Chimera dependency. This also relates to the previous point, it's easier to do this dependency bump if Chimera is a separate JAR. Best, Andrew On Mon, Jan 18, 2016 at 11:46 PM, Gangumalla, Uma <uma.ganguma...@intel.com> wrote: > Hi Devs, > > Some of our Hadoop developers working with Spark community to implement > the shuffle encryption. While implementing that, they realized some/most of > the code in Hadoop encryption code and their implemention code have to be > duplicated. This leads to an idea to create separate library, named it as > Chimera (https://github.com/intel-hadoop/chimera). It is an optimized > cryptographic library. It provides Java API for both cipher level and Java > stream level to help developers implement high performance AES > encryption/decryption with the minimum code and effort. Chimera was > originally based Hadoop crypto code but was improved and generalized a lot > for supporting wider scope of data encryption needs for more components in > the community. > > So, now team is thinking to make this library code as open source project > via Apache Incubation. Proposal is Chimera to join the Apache as > incubating or Apache commons for facilitating its adoption. > > In general this will get the following advantages: > 1. As Chimera embedded the native in jar (similar to Snappy java), it > solves the current issues in Hadoop that a HDFS client has to depend > libhadoop.so if the client needs to read encryption zone in HDFS. This > means a HDFS client may has to depend a Hadoop installation in local > machine. For example, HBase uses depends on HDFS client jar other than a > Hadoop installation and then has no access to libhadoop.so. So HBase cannot > use an encryption zone or it cause error. > 2. Apache Spark shuffle and spill encryption could be another example > where we can use Chimera. We see the fact that the stream encryption for > Spark shuffle and spill doesn’t require a stream cipher like AES/CTR, > although the code shares the common characteristics of a stream style API. > We also see the need of optimized Cipher for non-stream style use cases > such as network encryption such as RPC. These improvements actually can be > shared by more projects of need. > > 3. Simplified code in Hadoop to use dedicated library. And drives more > improvements. For example, current the Hadoop crypto code API is totally > based on AES/CTR although it has cipher suite configurations. > > AES/CTR is for HDFS data encryption at rest, but it doesn’t necessary to > be AES/CTR for all the cases such as Data transfer encryption and > intermediate file encryption. > > > > So, we wanted to check with Hadoop community about this proposal. Please > provide your feedbacks on it. > > Regards, > Uma >