>> Let's do one step at a time. There is a clear need for common encryption, >> and let's focus on making that happen. Strongly agree.
-----Original Message----- From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, February 4, 2016 8:50 AM To: hdfs-dev@hadoop.apache.org Subject: Re: Hadoop encryption module as Apache Chimera incubator project Let's do one step at a time. There is a clear need for common encryption, and let's focus on making that happen. On Wed, Feb 3, 2016 at 4:48 PM, Zheng, Kai <kai.zh...@intel.com> wrote: > I thought this discussion would switch to common-dev@ now? > > >> Would it make sense to also package some of the compression > >> libraries, > and maybe some of the text processing from MapReduce? Evolving some of > this code to a common library with few/no dependencies would be > generally useful. As a subproject, it could have a broader scope that > could evolve into a viable TLP. > > Sounds like a great idea to make the potential TLP more sense!! I > thought it could be organized like in Apache common, the security, > compression and other common text related things could be organized in > different independent modules. Perhaps Hadoop conf could also be > considered. These modules could rely on some common utility module. It > can still be Hadoop background or powered, and eventually we would > have a good place for some Hadoop common codes to move into to benefit > and impact even more broad scope than Hadoop itself. > > Regards, > Kai > > -----Original Message----- > From: Chris Douglas [mailto:cdoug...@apache.org] > Sent: Thursday, February 04, 2016 7:26 AM > To: hdfs-dev@hadoop.apache.org > Subject: Re: Hadoop encryption module as Apache Chimera incubator > project > > I went through the repository, and now understand the reasoning that > would locate this code in Apache Commons. This isn't proposing to > extract much of the implementation and it takes none of the > integration. It's limited to interfaces to crypto libraries and > streams/configuration. It might be a reasonable fit for commons-codec, > but that's a pretty sparse library and driving the release cadence > might be more complicated. It'd be worth discussing on their lists (please > also CC common-dev@). > > Chimera would be a boutique TLP, unless we wanted to draw out more of > the integration and tooling. Is that a goal you're interested in pursuing? > There's a tension between keeping this focused and including enough > functionality to make it viable as an independent component. By way of > example, Hadoop's common project requires too many dependencies and > carries too much historical baggage for other projects to rely on. > I agree with Colin/Steve: we don't want this to grow into another > guava-like dependency that creates more work in conflicts than it > saves in implementation... > > Would it make sense to also package some of the compression libraries, > and maybe some of the text processing from MapReduce? Evolving some of > this code to a common library with few/no dependencies would be > generally useful. As a subproject, it could have a broader scope that > could evolve into a viable TLP. If the encryption libraries are the > only ones you're interested in pulling out, then Apache Commons does > seem like a better target than a separate project. -C > > > On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cdoug...@apache.org> wrote: > > On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma > > <uma.ganguma...@intel.com> wrote: > >>>Standing in the point of shared fundamental piece of code like > >>>this, I do think Apache Commons might be the best direction which > >>>we can try as the first effort. In this direction, we still need to > >>>work with Apache Common community for buying in and accepting the proposal. > >> Make sense. > > > > Makes sense how? > > > >> For this we should define the independent release cycles for this > >> project and it would just place under Hadoop tree if we all > >> conclude with this option at the end. > > > > Yes. > > > >> [Chris] > >>>If Chimera is not successful as an independent project or stalls, > >>>Hadoop and/or Spark and/or $project will have to reabsorb it as > >>>maintainers. > >>> > >> I am not so strong on this point. If we assume project would be > >> unsuccessful, it can be unsuccessful(less maintained) even under hadoop. > >> But if other projects depending on this piece then they would get > >> less support. Of course right now we feel this piece of code is > >> very important and we feel(expect) it can be successful as > >> independent project, irrespective of whether it as separate project > >> outside hadoop > or inside. > >> So, I feel this point would not really influence to judge the > discussion. > > > > Sure; code can idle anywhere, but that wasn't the point I was after. > > You propose to extract code from Hadoop, but if Chimera fails then > > what recourse do we have among the other projects taking a > > dependency on it? Splitting off another project is feasible, but > > Chimera should be sustainable before this PMC can divest itself of > > responsibility for security libraries. That's a pretty low bar. > > > > Bundling the library with the jar is helpful; I've used that before. > > It should prefer (updated) libraries from the environment, if > > configured. Otherwise it's a pain (or impossible) for ops to patch > > security bugs. -C > > > >>>-----Original Message----- > >>>From: Colin P. McCabe [mailto:cmcc...@apache.org] > >>>Sent: Wednesday, February 3, 2016 4:56 AM > >>>To: hdfs-dev@hadoop.apache.org > >>>Subject: Re: Hadoop encryption module as Apache Chimera incubator > >>>project > >>> > >>>It's great to see interest in improving this functionality. I > >>>think Chimera could be successful as an Apache project. I don't > >>>have a strong opinion one way or the other as to whether it belongs > >>>as part of Hadoop or separate. > >>> > >>>I do think there will be some challenges splitting this > >>>functionality out into a separate jar, because of the way our > >>>CLASSPATH works right > now. > >>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark > >>>depends on Chimera 1.1. Now Spark jobs have two different versions > >>>fighting it out on the classpath, similar to the situation with > >>>Guava and other libraries. Perhaps if Chimera adopts a policy of > >>>strong backwards compatibility, we can just always use the latest > >>>jar, but it still seems likely that there will be problems. There > >>>are various classpath isolation ideas that could help here, but > >>>they are big projects in their own right and we don't have a clear > >>>timeline for them. If this does end up being a separate jar, we > >>>may need to shade it to avoid all these issues. > >>> > >>>Bundling the JNI glue code in the jar itself is an interesting > >>>idea, which we have talked about before for libhadoop.so. It > >>>doesn't really have anything to do with the question of TLP vs. > >>>non-TLP, of > course. > >>>We could do that refactoring in Hadoop itself. The really > >>>complicated part of bundling JNI code in a jar is that you need to > >>>create jars for every cross product of (JVM version, openssl > >>>version, > operating system). > >>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e. > >>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, > >>>then you might need to rebuild. And certainly using Ubuntu would > >>>be a rebuild. And so forth. This kind of clashes with Maven's > >>>philosophy of pulling prebuilt jars from the internet. > >>> > >>>Kai Zheng's question about whether we would bundle openSSL's > >>>libraries is a good one. Given the high rate of new > >>>vulnerabilities discovered in that library, it seems like bundling > >>>would require Hadoop users and vendors to update very frequently, > >>>much more frequently than Hadoop is traditionally updated. So > >>>probably we would > not choose to bundle openssl. > >>> > >>>best, > >>>Colin > >>> > >>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas > >>><cdoug...@apache.org> > >>>wrote: > >>>> As a subproject of Hadoop, Chimera could maintain its own cadence. > >>>> There's also no reason why it should maintain dependencies on > >>>> other parts of Hadoop, if those are separable. How is this > >>>> solution inadequate? > >>>> > >>>> If Chimera is not successful as an independent project or stalls, > >>>> Hadoop and/or Spark and/or $project will have to reabsorb it as > >>>> maintainers. Projects have high mortality in early life, and a > >>>> fight over inheritance/maintenance is something we'd like to avoid. > >>>> If, on the other hand, it develops enough of a community where it > >>>> is obviously viable, then we can (and should) break it out as a > >>>> TLP (as we have before). If other Apache projects take a > >>>> dependency on Chimera, we're open to adding them to security@hadoop. > >>>> > >>>> Unlike Yetus, which was largely rewritten right before it was > >>>> made into a TLP, security in Hadoop has a complicated pedigree. > >>>> If Chimera eventually becomes a TLP, it seems fair to include > >>>> those who work on it while it is a subproject. Declared upfront, > >>>> that criterion is fairer than any post hoc justification, and > >>>> will lead to a more accurate account of its community than a > >>>> subset of the Hadoop PMC/committers that volunteer. -C > >>>> > >>>> > >>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng > >>>><haifeng.c...@intel.com> > >>>>wrote: > >>>>> Thanks to all folks providing feedbacks and participating the > >>>>>discussions. > >>>>> > >>>>> @Owen, do you still have any concerns on going forward in the > >>>>>direction of Apache Commons (or other options, TLP)? > >>>>> > >>>>> Thanks, > >>>>> Haifeng > >>>>> > >>>>> -----Original Message----- > >>>>> From: Chen, Haifeng [mailto:haifeng.c...@intel.com] > >>>>> Sent: Saturday, January 30, 2016 10:52 AM > >>>>> To: hdfs-dev@hadoop.apache.org > >>>>> Subject: RE: Hadoop encryption module as Apache Chimera > >>>>> incubator project > >>>>> > >>>>>>> I believe encryption is becoming a core part of Hadoop. I > >>>>>>>think that moving core components out of Hadoop is bad from a > >>>>>>>project management perspective. > >>>>> > >>>>>> Although it's certainly true that encryption capabilities (in > >>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think > >>>>>>that should really influence whether or not the > >>>>>>non-Hadoop-specific encryption routines should be part of the > >>>>>>Hadoop code base, or part of the code base of another project that > >>>>>>Hadoop depends on. > >>>>>>If Chimera had existed as a library hosted at ASF when HDFS > >>>>>>encryption was first developed, HDFS probably would have just > >>>>>>added that as a dependency and been done with it. I don't think > >>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base. > >>>>> > >>>>> Agree with ATM. I want to also make an additional clarification. > >>>>>I agree that the encryption capabilities are becoming core to Hadoop. > >>>>>While this effort is to put common and shared encryption routines > >>>>>such as crypto stream implementations into a scope which can be > >>>>>widely shared across the Apache ecosystem. This doesn't move > >>>>>Hadoop encryption out of Hadoop (that is not possible). > >>>>> > >>>>> Agree if we make it a separate and independent releases project > >>>>>in Hadoop takes a step further than the existing approach and > >>>>>solve some issues (such as libhadoop.so problem). Frankly > >>>>>speaking, I think it is not the best option we can try. I also > >>>>>expect that an independent release project within Hadoop core > >>>>>will also complicate the existing release ideology of Hadoop release. > >>>>> > >>>>> Thanks, > >>>>> Haifeng > >>>>> > >>>>> -----Original Message----- > >>>>> From: Aaron T. Myers [mailto:a...@cloudera.com] > >>>>> Sent: Friday, January 29, 2016 9:51 AM > >>>>> To: hdfs-dev@hadoop.apache.org > >>>>> Subject: Re: Hadoop encryption module as Apache Chimera > >>>>> incubator project > >>>>> > >>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley > >>>>><omal...@apache.org> > >>>>>wrote: > >>>>> > >>>>>> I believe encryption is becoming a core part of Hadoop. I think > >>>>>>that moving core components out of Hadoop is bad from a project > >>>>>>management perspective. > >>>>>> > >>>>> > >>>>> Although it's certainly true that encryption capabilities (in > >>>>>HDFS, YARN, > >>>>> etc.) are becoming core to Hadoop, I don't think that should > >>>>>really influence whether or not the non-Hadoop-specific > >>>>>encryption routines should be part of the Hadoop code base, or > >>>>>part of the code base of another project that Hadoop depends on. > >>>>>If Chimera had existed as a library hosted at ASF when HDFS > >>>>>encryption was first developed, HDFS probably would have just > >>>>>added that as a dependency and been done with it. I don't think > >>>>>we would've copy/pasted the code for Chimera into the Hadoop code base. > >>>>> > >>>>> > >>>>>> To put it another way, a bug in the encryption routines will > >>>>>> likely become a security problem that security@hadoop needs to > hear about. > >>>>>> > >>>>> I don't think > >>>>>> adding a separate project in the middle of that communication > >>>>>>chain is a good idea. The same applies to data corruption > >>>>>>problems, and so on... > >>>>>> > >>>>> > >>>>> Isn't the same true of all the libraries that Hadoop currently > >>>>>depends upon? If the commons-httpclient library (or > >>>>>commons-codec, or commons-io, or guava, or...) has a security > >>>>>vulnerability, we need to know about it so that we can update our > >>>>>dependency to a fixed > version. > >>>>>This case doesn't seem materially different than that. > >>>>> > >>>>> > >>>>>> > >>>>>> > >>>>>> > It may be good to keep at generalized place(As in the > >>>>>> > discussion, we thought that place could be Apache Commons). > >>>>>> > >>>>>> > >>>>>> Apache Commons is a collection of *Java* projects, so Chimera > >>>>>> as a JNI-based library isn't a natural fit. > >>>>>> > >>>>> > >>>>> Could very well be that Apache Commons's charter would preclude > >>>>>Chimera. > >>>>> You probably know better than I do about that. > >>>>> > >>>>> > >>>>>> Furthermore, Apache Commons doesn't have its own security list > >>>>>> so problems will go to the generic secur...@apache.org. > >>>>>> > >>>>> > >>>>> That seems easy enough to remedy, if they wanted to, and besides > >>>>>I'm not sure why that would influence this discussion. In my > >>>>>experience projects that don't have a separate > >>>>>security@project.a.o mailing list tend to just handle security > >>>>>issues on their private@project.a.o mailing list, which seems fine to me. > >>>>> > >>>>> > >>>>>> > >>>>>> Why do you think that Apache Commons is a better home than Hadoop? > >>>>>> > >>>>> > >>>>> I'm certainly not at all wedded to Apache Commons, that just > >>>>>seemed like a natural place to put it to me. Could be that a > >>>>>brand new TLP might make more sense. > >>>>> > >>>>> I *do* think that if other non-Hadoop projects want to make use > >>>>>of Chimera, which as I understand it is the goal which started > >>>>>this thread, then Chimera should exist outside of Hadoop so that: > >>>>> > >>>>> a) Projects that have nothing to do with Hadoop can just depend > >>>>>directly on Chimera, which has nothing Hadoop-specific in there. > >>>>> > >>>>> b) The Hadoop project doesn't have to export/maintain/concern > >>>>>itself with yet another publicly-consumed interface. > >>>>> > >>>>> c) Chimera can have its own (presumably much faster) release > >>>>>cadence completely separate from Hadoop. > >>>>> > >>>>> -- > >>>>> Aaron T. Myers > >>>>> Software Engineer, Cloudera > >> >