Thanks Haifeng. I was just waiting if any more comments. If no objections further, I would initiate a discussion thread in Apache Commons in a day time and will also cc to hadoop common.
Regards, Uma On 2/11/16, 6:13 PM, "Chen, Haifeng" <haifeng.c...@intel.com> wrote: >Thanks all the folks participating this discussion and providing valuable >suggestions and options. > >I suggest we take it forward to make a proposal in Apache Commons >community. > >Thanks, >Haifeng > >-----Original Message----- >From: Chen, Haifeng [mailto:haifeng.c...@intel.com] >Sent: Friday, February 5, 2016 10:06 AM >To: hdfs-dev@hadoop.apache.org; common-...@hadoop.apache.org >Subject: RE: Hadoop encryption module as Apache Chimera incubator project > >> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it >>would need to sustain a community. If the scope is too narrow, then it >>will quickly fall into maintenance mode, its contributors will move on, >>and it will retire to the attic. Alone, I doubt its viability as a TLP. >>So as a first option, donating only this code to Apache Commons would >>accomplish some immediate goals in a sustainable forum. >Totally agree. As a TLP it needs nice scope and roadmap to sustain a >development community. > >Thanks, >Haifeng > >-----Original Message----- >From: Chris Douglas [mailto:cdoug...@apache.org] >Sent: Friday, February 5, 2016 6:28 AM >To: common-...@hadoop.apache.org >Cc: hdfs-dev@hadoop.apache.org >Subject: Re: Hadoop encryption module as Apache Chimera incubator project > >On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma ><uma.ganguma...@intel.com> wrote: > >> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You >> mean to cc Apache commons as well?) > >I meant, if you start a discussion with Apache Commons, please CC >common-dev@hadoop to coordinate. > >> [UMA] Right now we plan to have encryption libraries are the only >> one¹s we planned and as we see lot of interest from other projects >> like spark to use them. I see some challenges when we bring lot of >> code(other common >> codes) into this project is that, they all would have different >> requirements and may be different expected timelines for release etc. >> Some projects may just wanted to use encryption interfaces alone but >>not all. >> As they are completely independent codes, may be better to scope out >> clearly. > >Yes, but even if the artifact is widely consumed, as a TLP it would need >to sustain a community. If the scope is too narrow, then it will quickly >fall into maintenance mode, its contributors will move on, and it will >retire to the attic. Alone, I doubt its viability as a TLP. So as a first >option, donating only this code to Apache Commons would accomplish some >immediate goals in a sustainable forum. > >APR has a similar scope. As a second option, that may also be a >reasonable home, particularly if some of the native bits could integrate >with APR. > >If the scope is broader, the effort could sustain prolonged development. >The current code is developing a strategy for packing native libraries on >multiple platforms, a capability that, say, the native compression codecs >(AFAIK) still lack. While java.nio is improving, many projects would >benefit from a better, native interface to the filesystem (e.g., >NativeIO). We could avoid duplicating effort and collaborate on a common >library. > >As a third option, Hadoop already implements some useful native >libraries, which is why a subproject might be a sound course. That would >enable the subproject to coordinate with Hadoop on migrating its native >functionality to a separable, reusable component, then move to a TLP when >we can rely on it exclusively (if it has a well-defined, independent >community). It could control its release cadence and limit its >dependencies. > >Finally, this is beside the point if nobody is interested in doing the >work on such a project. It's rude to pull code out of Hadoop and donate >it to another project so Spark can avoid a dependency, but this instance >seems reasonable to me. -C > >[1] https://apr.apache.org/ > >> On 2/3/16, 6:46 PM, "Chen, Haifeng" <haifeng.c...@intel.com> wrote: >> >>>Thanks Chris. >>> >>>>> I went through the repository, and now understand the reasoning >>>>>that would locate this code in Apache Commons. This isn't proposing >>>>>to extract much of the implementation and it takes none of the >>>>>integration. It's limited to interfaces to crypto libraries and >>>>>streams/configuration. >>>Exactly. >>> >>>>> Chimera would be a boutique TLP, unless we wanted to draw out more >>>>>of the integration and tooling. Is that a goal you're interested in >>>>>pursuing? There's a tension between keeping this focused and >>>>>including enough functionality to make it viable as an independent >>>>>component. >>>The Chimera goal was for providing useful, common and optimized >>>cryptographic functionalities. I would prefer that it is still focused >>>in this clear scope. Multiple domain requirements will put more >>>challenges and uncertainties in where and how it should go, thus more >>>risk in stalling. >>> >>>>> If the encryption libraries are the only ones you're interested in >>>>>pulling out, then Apache Commons does seem like a better target than >>>>>a separate project. >>>Yes. Just mentioned above, the library will be positioned in >>>cryptographic. >>> >>> >>>Thanks, >>> >>>-----Original Message----- >>>From: Chris Douglas [mailto:cdoug...@apache.org] >>>Sent: Thursday, February 4, 2016 7:26 AM >>>To: hdfs-dev@hadoop.apache.org >>>Subject: Re: Hadoop encryption module as Apache Chimera incubator >>>project >>> >>>I went through the repository, and now understand the reasoning that >>>would locate this code in Apache Commons. This isn't proposing to >>>extract much of the implementation and it takes none of the >>>integration. It's limited to interfaces to crypto libraries and >>>streams/configuration. It might be a reasonable fit for commons-codec, >>>but that's a pretty sparse library and driving the release cadence >>>might be more complicated. It'd be worth discussing on their lists >>>(please also CC common-dev@). >>> >>>Chimera would be a boutique TLP, unless we wanted to draw out more of >>>the integration and tooling. Is that a goal you're interested in >>>pursuing? >>>There's a tension between keeping this focused and including enough >>>functionality to make it viable as an independent component. By way of >>>example, Hadoop's common project requires too many dependencies and >>>carries too much historical baggage for other projects to rely on. >>>I agree with Colin/Steve: we don't want this to grow into another >>>guava-like dependency that creates more work in conflicts than it >>>saves in implementation... >>> >>>Would it make sense to also package some of the compression libraries, >>>and maybe some of the text processing from MapReduce? Evolving some of >>>this code to a common library with few/no dependencies would be >>>generally useful. As a subproject, it could have a broader scope that >>>could evolve into a viable TLP. If the encryption libraries are the >>>only ones you're interested in pulling out, then Apache Commons does >>>seem like a better target than a separate project. -C >>> >>> >>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cdoug...@apache.org> >>>wrote: >>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma >>>> <uma.ganguma...@intel.com> wrote: >>>>>>Standing in the point of shared fundamental piece of code like >>>>>>this, I do think Apache Commons might be the best direction which >>>>>>we can try as the first effort. In this direction, we still need to >>>>>>work with Apache Common community for buying in and accepting the >>>>>>proposal. >>>>> Make sense. >>>> >>>> Makes sense how? >>>> >>>>> For this we should define the independent release cycles for this >>>>> project and it would just place under Hadoop tree if we all >>>>> conclude with this option at the end. >>>> >>>> Yes. >>>> >>>>> [Chris] >>>>>>If Chimera is not successful as an independent project or stalls, >>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as >>>>>>maintainers. >>>>>> >>>>> I am not so strong on this point. If we assume project would be >>>>>unsuccessful, it can be unsuccessful(less maintained) even under >>>>>hadoop. >>>>> But if other projects depending on this piece then they would get >>>>>less support. Of course right now we feel this piece of code is very >>>>>important and we feel(expect) it can be successful as independent >>>>>project, irrespective of whether it as separate project outside >>>>>hadoop or inside. >>>>> So, I feel this point would not really influence to judge the >>>>>discussion. >>>> >>>> Sure; code can idle anywhere, but that wasn't the point I was after. >>>> You propose to extract code from Hadoop, but if Chimera fails then >>>> what recourse do we have among the other projects taking a >>>> dependency on it? Splitting off another project is feasible, but >>>> Chimera should be sustainable before this PMC can divest itself of >>>> responsibility for security libraries. That's a pretty low bar. >>>> >>>> Bundling the library with the jar is helpful; I've used that before. >>>> It should prefer (updated) libraries from the environment, if >>>> configured. Otherwise it's a pain (or impossible) for ops to patch >>>> security bugs. -C >>>> >>>>>>-----Original Message----- >>>>>>From: Colin P. McCabe [mailto:cmcc...@apache.org] >>>>>>Sent: Wednesday, February 3, 2016 4:56 AM >>>>>>To: hdfs-dev@hadoop.apache.org >>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator >>>>>>project >>>>>> >>>>>>It's great to see interest in improving this functionality. I >>>>>>think Chimera could be successful as an Apache project. I don't >>>>>>have a strong opinion one way or the other as to whether it belongs >>>>>>as part of Hadoop or separate. >>>>>> >>>>>>I do think there will be some challenges splitting this >>>>>>functionality out into a separate jar, because of the way our >>>>>>CLASSPATH works right now. >>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark >>>>>>depends on Chimera 1.1. Now Spark jobs have two different versions >>>>>>fighting it out on the classpath, similar to the situation with >>>>>>Guava and other libraries. Perhaps if Chimera adopts a policy of >>>>>>strong backwards compatibility, we can just always use the latest >>>>>>jar, but it still seems likely that there will be problems. There >>>>>>are various classpath isolation ideas that could help here, but >>>>>>they are big projects in their own right and we don't have a clear >>>>>>timeline for them. If this does end up being a separate jar, we >>>>>>may need to shade it to avoid all these issues. >>>>>> >>>>>>Bundling the JNI glue code in the jar itself is an interesting >>>>>>idea, which we have talked about before for libhadoop.so. It >>>>>>doesn't really have anything to do with the question of TLP vs. >>>>>>non-TLP, of course. >>>>>>We could do that refactoring in Hadoop itself. The really >>>>>>complicated part of bundling JNI code in a jar is that you need to >>>>>>create jars for every cross product of (JVM version, openssl >>>>>>version, operating system). >>>>>>For example, you have the RHEL6 build for openJDK7 using openssl >>>>>>1.0.1e. >>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, >>>>>>then you might need to rebuild. And certainly using Ubuntu would >>>>>>be a rebuild. And so forth. This kind of clashes with Maven's >>>>>>philosophy of pulling prebuilt jars from the internet. >>>>>> >>>>>>Kai Zheng's question about whether we would bundle openSSL's >>>>>>libraries is a good one. Given the high rate of new >>>>>>vulnerabilities discovered in that library, it seems like bundling >>>>>>would require Hadoop users and vendors to update very frequently, >>>>>>much more frequently than Hadoop is traditionally updated. So >>>>>>probably we would not choose to bundle openssl. >>>>>> >>>>>>best, >>>>>>Colin >>>>>> >>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas >>>>>><cdoug...@apache.org> >>>>>>wrote: >>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence. >>>>>>> There's also no reason why it should maintain dependencies on >>>>>>> other parts of Hadoop, if those are separable. How is this >>>>>>> solution inadequate? >>>>>>> >>>>>>> If Chimera is not successful as an independent project or stalls, >>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as >>>>>>> maintainers. Projects have high mortality in early life, and a >>>>>>> fight over inheritance/maintenance is something we'd like to avoid. >>>>>>> If, on the other hand, it develops enough of a community where it >>>>>>> is obviously viable, then we can (and should) break it out as a >>>>>>> TLP (as we have before). If other Apache projects take a >>>>>>> dependency on Chimera, we're open to adding them to >>>>>>>security@hadoop. >>>>>>> >>>>>>> Unlike Yetus, which was largely rewritten right before it was >>>>>>> made into a TLP, security in Hadoop has a complicated pedigree. >>>>>>> If Chimera eventually becomes a TLP, it seems fair to include >>>>>>> those who work on it while it is a subproject. Declared upfront, >>>>>>> that criterion is fairer than any post hoc justification, and >>>>>>> will lead to a more accurate account of its community than a >>>>>>> subset of the Hadoop PMC/committers that volunteer. -C >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng >>>>>>><haifeng.c...@intel.com> >>>>>>>wrote: >>>>>>>> Thanks to all folks providing feedbacks and participating the >>>>>>>>discussions. >>>>>>>> >>>>>>>> @Owen, do you still have any concerns on going forward in the >>>>>>>>direction of Apache Commons (or other options, TLP)? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Haifeng >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Chen, Haifeng [mailto:haifeng.c...@intel.com] >>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM >>>>>>>> To: hdfs-dev@hadoop.apache.org >>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera >>>>>>>> incubator project >>>>>>>> >>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I >>>>>>>>>>think that moving core components out of Hadoop is bad from a >>>>>>>>>>project management perspective. >>>>>>>> >>>>>>>>> Although it's certainly true that encryption capabilities (in >>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think >>>>>>>>>that should really influence whether or not the >>>>>>>>>non-Hadoop-specific encryption routines should be part of the >>>>>>>>>Hadoop code base, or part of the code base of another project >>>>>>>>>that Hadoop depends on. >>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS >>>>>>>>>encryption was first developed, HDFS probably would have just >>>>>>>>>added that as a dependency and been done with it. I don't think >>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code >>>>>>>>>base. >>>>>>>> >>>>>>>> Agree with ATM. I want to also make an additional clarification. >>>>>>>>I agree that the encryption capabilities are becoming core to >>>>>>>>Hadoop. >>>>>>>>While this effort is to put common and shared encryption routines >>>>>>>>such as crypto stream implementations into a scope which can be >>>>>>>>widely shared across the Apache ecosystem. This doesn't move >>>>>>>>Hadoop encryption out of Hadoop (that is not possible). >>>>>>>> >>>>>>>> Agree if we make it a separate and independent releases project >>>>>>>>in Hadoop takes a step further than the existing approach and >>>>>>>>solve some issues (such as libhadoop.so problem). Frankly >>>>>>>>speaking, I think it is not the best option we can try. I also >>>>>>>>expect that an independent release project within Hadoop core >>>>>>>>will also complicate the existing release ideology of Hadoop >>>>>>>>release. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Haifeng >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Aaron T. Myers [mailto:a...@cloudera.com] >>>>>>>> Sent: Friday, January 29, 2016 9:51 AM >>>>>>>> To: hdfs-dev@hadoop.apache.org >>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera >>>>>>>> incubator project >>>>>>>> >>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley >>>>>>>><omal...@apache.org> >>>>>>>>wrote: >>>>>>>> >>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think >>>>>>>>>that moving core components out of Hadoop is bad from a project >>>>>>>>>management perspective. >>>>>>>>> >>>>>>>> >>>>>>>> Although it's certainly true that encryption capabilities (in >>>>>>>>HDFS, YARN, >>>>>>>> etc.) are becoming core to Hadoop, I don't think that should >>>>>>>>really influence whether or not the non-Hadoop-specific >>>>>>>>encryption routines should be part of the Hadoop code base, or >>>>>>>>part of the code base of another project that Hadoop depends on. >>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS >>>>>>>>encryption was first developed, HDFS probably would have just >>>>>>>>added that as a dependency and been done with it. I don't think >>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code >>>>>>>>base. >>>>>>>> >>>>>>>> >>>>>>>>> To put it another way, a bug in the encryption routines will >>>>>>>>>likely become a security problem that security@hadoop needs to >>>>>>>>>hear about. >>>>>>>>> >>>>>>>> I don't think >>>>>>>>> adding a separate project in the middle of that communication >>>>>>>>>chain is a good idea. The same applies to data corruption >>>>>>>>>problems, and so on... >>>>>>>>> >>>>>>>> >>>>>>>> Isn't the same true of all the libraries that Hadoop currently >>>>>>>>depends upon? If the commons-httpclient library (or >>>>>>>>commons-codec, or commons-io, or guava, or...) has a security >>>>>>>>vulnerability, we need to know about it so that we can update our >>>>>>>>dependency to a fixed version. >>>>>>>>This case doesn't seem materially different than that. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> > It may be good to keep at generalized place(As in the >>>>>>>>> > discussion, we thought that place could be Apache Commons). >>>>>>>>> >>>>>>>>> >>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera >>>>>>>>> as a JNI-based library isn't a natural fit. >>>>>>>>> >>>>>>>> >>>>>>>> Could very well be that Apache Commons's charter would preclude >>>>>>>>Chimera. >>>>>>>> You probably know better than I do about that. >>>>>>>> >>>>>>>> >>>>>>>>> Furthermore, Apache Commons doesn't have its own security list >>>>>>>>> so problems will go to the generic secur...@apache.org. >>>>>>>>> >>>>>>>> >>>>>>>> That seems easy enough to remedy, if they wanted to, and besides >>>>>>>>I'm not sure why that would influence this discussion. In my >>>>>>>>experience projects that don't have a separate >>>>>>>>security@project.a.o mailing list tend to just handle security >>>>>>>>issues on their private@project.a.o mailing list, which seems fine >>>>>>>>to me. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Why do you think that Apache Commons is a better home than >>>>>>>>>Hadoop? >>>>>>>>> >>>>>>>> >>>>>>>> I'm certainly not at all wedded to Apache Commons, that just >>>>>>>>seemed like a natural place to put it to me. Could be that a >>>>>>>>brand new TLP might make more sense. >>>>>>>> >>>>>>>> I *do* think that if other non-Hadoop projects want to make use >>>>>>>>of Chimera, which as I understand it is the goal which started >>>>>>>>this thread, then Chimera should exist outside of Hadoop so that: >>>>>>>> >>>>>>>> a) Projects that have nothing to do with Hadoop can just depend >>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there. >>>>>>>> >>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern >>>>>>>>itself with yet another publicly-consumed interface. >>>>>>>> >>>>>>>> c) Chimera can have its own (presumably much faster) release >>>>>>>>cadence completely separate from Hadoop. >>>>>>>> >>>>>>>> -- >>>>>>>> Aaron T. Myers >>>>>>>> Software Engineer, Cloudera >>>>> >>