RE: Hadoop encryption module as Apache Chimera incubator project

Zheng, Kai Thu, 21 Jan 2016 17:12:10 -0800

Thanks Chris for the pointer and Uma for the confirm!

I'm happy to know HADOOP-11127 and there are already so many solid discussions 
in it. I will go through it, make my investigation and see how I can help in 
the effort.


Sure let's go back to Chimera and sorry fo the interrupt.

Regards,
Kai

-----Original Message-----
From: Gangumalla, Uma [mailto:[email protected]] 
Sent: Friday, January 22, 2016 8:38 AM
To: [email protected]
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

>Uma and everyone, thank you for the proposal.  +1 to proceed.
Thanks Chris for your feedback.

Kai Wrote:
I believe Haifeng had mentioned the problem in a call when discussing erasure 
coding work, but until now I got to understand what's the problem and how 
Chimera or Snappy Java solved it. It looks like there can be some thin clients 
that don't rely on Hadoop installation so no libhadoop.so is available to use 
on the client host. The approach mentioned here is to bundle the library file 
(*.so) into a jar and dynamically extract the file when loading it. When no 
library file is contained in the jar then it goes to the normal case, loading 
it from an installation. It's smart and nice! My question is, could we consider 
to adopt the approach for libhadoop.so library? It might be worth to discuss 
because, we're bundling more and more things into the library (recently we just 
put Intel ISA-L support into it), and such things may be desired for such 
clients. It may also be helpful for development, because sometimes when run 
unit tests that involve native codes, some error may happen and complain no 
place to find libhadoop.so. Thanks.
[UMA] Good points Kai. It is good to think and invest some efforts to solve 
libhadoop.so part.
 As Chris suggested taking this discussion into that JIRA HADOOP-11127 is more 
appropriate thing to do.


Regards,
Uma


On 1/21/16, 12:18 PM, "Chris Nauroth" <[email protected]> wrote:

>> My question is, could we consider to adopt the approach for 
>>libhadoop.so library?
>
>
>This is something that I have proposed already in HADOOP-11127.  There 
>is not consensus on proceeding with it from the contributors in that 
>discussion.  There are some big challenges around how it would impact 
>the release process.  I also have not had availability to prototype an 
>implementation to make a stronger case for feasibility.  Kai, if this 
>is something that you're interested in, then I encourage you to join 
>the discussion in HADOOP-11127 or even pick up prototyping work if you'd like.
> Since we have that existing JIRA, let's keep this mail thread focused 
>just on Chimera.  Thank you!
>
>Uma and everyone, thank you for the proposal.  +1 to proceed.
>
>--Chris Nauroth
>
>
>
>
>On 1/20/16, 11:16 PM, "Zheng, Kai" <[email protected]> wrote:
>
>>Thanks Uma. 
>>
>>I have a question by the way, it's not about Chimera project, but 
>>about the mentioned advantage 1 and libhadoop.so installation problem. 
>>I copied the saying as below for convenience.
>>
>>>>1. As Chimera embedded the native in jar (similar to Snappy java), 
>>>>it solves the current issues in Hadoop that a HDFS client has to 
>>>>depend libhadoop.so if the client needs to read encryption zone in 
>>>>HDFS. This means a HDFS client may has to depend a Hadoop 
>>>>installation in local machine. For example, HBase uses depends on 
>>>>HDFS client jar other than a Hadoop installation and then has no 
>>>>access to libhadoop.so. So HBase cannot use an encryption zone or it cause 
>>>>error.
>>
>>I believe Haifeng had mentioned the problem in a call when discussing 
>>erasure coding work, but until now I got to understand what's the 
>>problem and how Chimera or Snappy Java solved it. It looks like there 
>>can be some thin clients that don't rely on Hadoop installation so no 
>>libhadoop.so is available to use on the client host. The approach 
>>mentioned here is to bundle the library file (*.so) into a jar and 
>>dynamically extract the file when loading it. When no library file is 
>>contained in the jar then it goes to the normal case, loading it from 
>>an installation. It's smart and nice! My question is, could we 
>>consider to adopt the approach for libhadoop.so library? It might be 
>>worth to discuss because, we're bundling more and more things into the 
>>library (recently we just put Intel ISA-L support into it), and such 
>>things may be desired for such clients. It may also be helpful for 
>>development, because sometimes when run unit tests that involve native 
>>codes, some error may happen and complain no place to find libhadoop.so. 
>>Thanks.
>>
>>Regards,
>>Kai
>>
>>-----Original Message-----
>>From: Gangumalla, Uma [mailto:[email protected]]
>>Sent: Thursday, January 21, 2016 11:20 AM
>>To: [email protected]
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>project
>>
>>Hi All,
>>Thanks Andrew, ATM, Yi, Kai, Larry. Thanks Haifeng on clarifying 
>>release stuff.
>>
>>Please find my responses below.
>>
>>Andrew wrote:
>>If it becomes part of Apache Commons, could we make Chimera a separate 
>>JAR? We have real difficulties bumping dependency versions right now, 
>>so ideally we don't need to bump our existing Commons dependencies to 
>>use Chimera.
>>[UMA] Yes, We plan to make separate Jar.
>>
>>Andrew wrote:
>>With this refactoring, do we have confidence that we can get our 
>>desired changes merged and released in a timely fashion? e.g. if we 
>>find another bug like HADOOP-11343, we'll first need to get the fix 
>>into Chimera, have a new Chimera release, then bump Hadoop's Chimera 
>>dependency. This also relates to the previous point, it's easier to do 
>>this dependency bump if Chimera is a separate JAR.
>>[UMA] Yes and the main target users for this project is Hadoop and 
>>Spark right now.
>>So, Hadoop requirements would be the priority tasks for it.
>>
>>
>>ATM wrote:
>>Uma, would you be up for approaching the Apache Commons folks saying 
>>that you'd like to contribute Chimera? I'd recommend saying that 
>>Hadoop and Spark are both on board to depend on this.
>>[UMA] Yes, will do that.
>>
>>
>>Kai wrote:
>>Just a question. Becoming a separate jar/module in Apache Commons 
>>means Chimera or the module can be released separately or in a timely 
>>manner, not coupling with other modules for release in the project? Thanks.
>>
>>[Haifeng] From apache commons project web 
>>(https://commons.apache.org/), we see there is already a long list of 
>>components in its Apache Commons Proper list. Each component has its 
>>own release version and date. To join and be one of the list is the target.
>>
>>Larry wrote:
>>If what we are looking for is some level of autonomy then it would 
>>need to be a module with its own release train - or at least be able to.
>>
>>[UMA] Yes. Agree
>>
>>Kai wrote:
>>So far I saw it's mainly about AES-256. I suggest the scope can be 
>>expanded a little bit, perhaps a dedicated high performance encryption 
>>library, then we would have quite much to contribute to it, like other 
>>ciphers, MACs, PRNGs and so on. Then both Hadoop and Spark can benefit 
>>from it.
>>
>>[UMA] Yes, once development started as separate project then its free 
>>to evolve and provide more improvements to support more customer/user 
>>space for encryption based on demand.
>>Haifeng, would you add some points here?
>>
>>
>>Regards,
>>Uma
>>
>>On 1/20/16, 4:31 PM, "Andrew Wang" <[email protected]> wrote:
>>
>>>Thanks Uma for putting together this proposal. Overall sounds good to 
>>>me,
>>>+1 for these improvements. A few comments/questions:
>>>
>>>* If it becomes part of Apache Commons, could we make Chimera a 
>>>separate JAR? We have real difficulties bumping dependency versions 
>>>right now, so ideally we don't need to bump our existing Commons 
>>>dependencies to use Chimera.
>>>* With this refactoring, do we have confidence that we can get our 
>>>desired changes merged and released in a timely fashion? e.g. if we 
>>>find another bug like HADOOP-11343, we'll first need to get the fix 
>>>into Chimera, have a new Chimera release, then bump Hadoop's Chimera 
>>>dependency. This also relates to the previous point, it's easier to 
>>>do this dependency bump if Chimera is a separate JAR.
>>>
>>>Best,
>>>Andrew
>>>
>>>On Mon, Jan 18, 2016 at 11:46 PM, Gangumalla, Uma 
>>><[email protected]>
>>>wrote:
>>>
>>>> Hi Devs,
>>>>
>>>>   Some of our Hadoop developers working with Spark community to 
>>>>implement  the shuffle encryption. While implementing that, they 
>>>>realized some/most of  the code in Hadoop encryption code and their 
>>>>implemention code have to be  duplicated. This leads to an idea to 
>>>>create separate library, named it as  Chimera 
>>>>(https://github.com/intel-hadoop/chimera). It is an optimized 
>>>>cryptographic library. It provides Java API for both cipher level 
>>>>and Java  stream level to help developers implement high performance 
>>>>AES encryption/decryption with the minimum code and effort. Chimera 
>>>>was originally based Hadoop crypto code but was improved and 
>>>>generalized a lot  for supporting wider scope of data encryption 
>>>>needs for more components in  the community.
>>>>
>>>> So, now team is thinking to make this library code as open source 
>>>>project  via Apache Incubation.  Proposal is Chimera to join the 
>>>>Apache as  incubating or Apache commons for facilitating its adoption.
>>>>
>>>> In general this will get the following advantages:
>>>> 1. As Chimera embedded the native in jar (similar to Snappy java), 
>>>>it solves the current issues in Hadoop that a HDFS client has to 
>>>>depend libhadoop.so if the client needs to read encryption zone in 
>>>>HDFS. This means a HDFS client may has to depend a Hadoop 
>>>>installation in local machine. For example, HBase uses depends on 
>>>>HDFS client jar other than a  Hadoop installation and then has no 
>>>>access to libhadoop.so. So HBase cannot  use an encryption zone or it cause 
>>>>error.
>>>> 2. Apache Spark shuffle and spill encryption could be another 
>>>>example where we can use Chimera. We see the fact that the stream 
>>>>encryption for  Spark shuffle and spill doesn¹t require a stream 
>>>>cipher like AES/CTR,  although the code shares the common 
>>>>characteristics of a stream style API.
>>>> We also see the need of optimized Cipher for non-stream style use 
>>>>cases  such as network encryption such as RPC. These improvements 
>>>>actually can be  shared by more projects of need.
>>>>
>>>> 3. Simplified code in Hadoop to use dedicated library. And drives 
>>>> more improvements. For example, current the Hadoop crypto code API 
>>>> is totally based on AES/CTR although it has cipher suite configurations.
>>>>
>>>> AES/CTR is for HDFS data encryption at rest, but it doesn¹t 
>>>> necessary to be AES/CTR for all the cases such as Data transfer 
>>>> encryption and intermediate file encryption.
>>>>
>>>>
>>>>
>>>>  So, we wanted to check with Hadoop community about this proposal.
>>>>Please
>>>> provide your feedbacks on it.
>>>>
>>>> Regards,
>>>> Uma
>>>>
>>
>>
>

RE: Hadoop encryption module as Apache Chimera incubator project

Reply via email to