On Sat, Mar 1, 2014 at 2:05 AM, Patrick Wendell <pwend...@gmail.com> wrote: > Hey, > > Thanks everyone for chiming in on this. I wanted to summarize these > issues a bit particularly wrt the constituents involved - does this > seem accurate? > > = Spark Users = > In general those linking against Spark should be totally unaffected by > the build choice. Spark will continue to publish well-formed poms and > jars to maven central. This is a no-op wrt this decision. > > = Spark Developers = > There are two concerns. (a) General day-to-day development and > packaging and (b) Spark binaries and packages for distribution. > > For (a) - sbt seems better because it's just nicer for doing scala > development (incremental complication is simple, we have some > home-baked tools for compiling Spark vs. the spark deps etc). The > arguments that maven has more "general know how", at least so far, > haven't affected us in the ~2 years we've maintained both builds - > where adding stuff for Maven is typically just as annoying/difficult > with sbt. > > For (b) - Some non-specific concerns were raised about bugs with the > sbt assembly package - we should look into this and see what is going > on. Maven has better out-of-the-box support for publishing to Maven > central, we'd have to do some manual work on our end to make this work > well with sbt.
Not non-specific concerns, assembly via sbt is fragile - the (manual) exclusion rules in sbt project are testament to this. In particular, I dont see any quantifiable benefits in using sbt over maven. Incremental compilation, compiling only a subproject, running specific tests, etc are all available even with maven - so are not differentiators. On other hand, sbt does introduce further manual overhead in dependency management for assembled/shaded jar creation. Regards, Mridul > > = Downstream Integrators = > On this one it seems that Maven is the universal favorite, largely > because of community awareness of Maven and comfort with Maven builds. > Some things like restructuring the Spark build to inherit config > values from a vendor build will be not possible with sbt (though > fairly straightforward to work around). Other cases where vendors have > directly modified or inherited the Spark build won't work anymore if > we standardize on SBT. These have no obvious work around at this point > as far as I see. > > - Patrick > > On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan <mri...@gmail.com> wrote: >> On Feb 26, 2014 11:12 PM, "Patrick Wendell" <pwend...@gmail.com> wrote: >>> >>> @mridul - As far as I know both Maven and Sbt use fairly similar >>> processes for building the assembly/uber jar. We actually used to >>> package spark with sbt and there were no specific issues we >>> encountered and AFAIK sbt respects versioning of transitive >>> dependencies correctly. Do you have a specific bug listing for sbt >>> that indicates something is broken? >> >> Slightly longish ... >> >> The assembled jar, generated via sbt broke all over the place while I was >> adding yarn support in 0.6 - and I had to fix sbt project a fair bit to get >> it to work : we need the assembled jar to submit a yarn job. >> >> When I finally submitted those changes to 0.7, it broke even more - since >> dependencies changed : someone else had thankfully already added maven >> support by then - which worked remarkably well out of the box (with some >> minor tweaks) ! >> >> In theory, they might be expected to work the same, but practically they >> did not : as I mentioned, it must just have been luck that maven worked >> that well; but given multiple past nasty experiences with sbt, and the fact >> that it does not bring anything compelling or new in contrast, I am fairly >> against the idea of using only sbt - inspite of maven being unintuitive at >> times. >> >> Regards, >> Mridul >> >>> >>> @sandy - It sounds like you are saying that the CDH build would be >>> easier with Maven because you can inherit the POM. However, is this >>> just a matter of convenience for packagers or would standardizing on >>> sbt limit capabilities in some way? I assume that it would just mean a >>> bit more manual work for packagers having to figure out how to set the >>> hadoop version in SBT and exclude certain dependencies. For instance, >>> what does CDH about other components like Impala that are not based on >>> Maven at all? >>> >>> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <e...@ooyala.com> wrote: >>> > I'd like to propose the following way to move forward, based on the >>> > comments I've seen: >>> > >>> > 1. Aggressively clean up the giant dependency graph. One ticket I >>> > might work on if I have time is SPARK-681 which might remove the giant >>> > fastutil dependency (~15MB by itself). >>> > >>> > 2. Take an intermediate step by having only ONE source of truth >>> > w.r.t. dependencies and versions. This means either: >>> > a) Using a maven POM as the spec for dependencies, Hadoop version, >>> > etc. Then, use sbt-pom-reader to import it. >>> > b) Using the build.scala as the spec, and "sbt make-pom" to >>> > generate the pom.xml for the dependencies >>> > >>> > The idea is to remove the pain and errors associated with manual >>> > translation of dependency specs from one system to another, while >>> > still maintaining the things which are hard to translate (plugins). >>> > >>> > >>> > On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <ko...@tresata.com> >> wrote: >>> >> We maintain in house spark build using sbt. We have no problem using >> sbt >>> >> assembly. We did add a few exclude statements for transitive >> dependencies. >>> >> >>> >> The main enemy of assemblies are jars that include stuff they shouldn't >>> >> (kryo comes to mind, I think they include logback?), new versions of >> jars >>> >> that change the provider/artifact without changing the package (asm), >> and >>> >> incompatible new releases (protobuf). These break the transitive >> resolution >>> >> process. I imagine that's true for any build tool. >>> >> >>> >> Besides shading I don't see anything maven can do sbt cannot, and if I >>> >> understand it correctly shading is not done currently using the build >> tool. >>> >> >>> >> Since spark is primarily scala/akka based the main developer base will >> be >>> >> familiar with sbt (I think?). Switching build tool is always painful. I >>> >> personally think it is smarter to put this burden on a limited number >> of >>> >> upstream integrators than on the community. However that said I don't >> think >>> >> its a problem for us to maintain an sbt build in-house if spark >> switched to >>> >> maven. >>> >> The problem is, the complete spark dependency graph is fairly large, >>> >> and there are lot of conflicting versions in there. >>> >> In particular, when we bump versions of dependencies - making managing >>> >> this messy at best. >>> >> >>> >> Now, I have not looked in detail at how maven manages this - it might >>> >> just be accidental that we get a decent out-of-the-box assembled >>> >> shaded jar (since we dont do anything great to configure it). >>> >> With current state of sbt in spark, it definitely is not a good >>> >> solution : if we can enhance it (or it already is ?), while keeping >>> >> the management of the version/dependency graph manageable, I dont have >>> >> any objections to using sbt or maven ! >>> >> Too many exclude versions, pinned versions, etc would just make things >>> >> unmanageable in future. >>> >> >>> >> >>> >> Regards, >>> >> Mridul >>> >> >>> >> >>> >> >>> >> >>> >> On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote: >>> >>> Actually you can control exactly how sbt assembly merges or resolves >>> >> conflicts. I believe the default settings however lead to order which >>> >> cannot be controlled. >>> >>> >>> >>> I do wish for a smarter fat jar plugin. >>> >>> >>> >>> -Evan >>> >>> To be free is not merely to cast off one's chains, but to live in a >> way >>> >> that respects & enhances the freedom of others. (#NelsonMandela) >>> >>> >>> >>>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <mri...@gmail.com> >>> >> wrote: >>> >>>> >>> >>>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <pwend...@gmail.com >>> >>> >> wrote: >>> >>>>> Evan - this is a good thing to bring up. Wrt the shader plug-in - >>> >>>>> right now we don't actually use it for bytecode shading - we simply >>> >>>>> use it for creating the uber jar with excludes (which sbt supports >>> >>>>> just fine via assembly). >>> >>>> >>> >>>> >>> >>>> Not really - as I mentioned initially in this thread, sbt's assembly >>> >>>> does not take dependencies into account properly : and can overwrite >>> >>>> newer classes with older versions. >>> >>>> From an assembly point of view, sbt is not very good : we are yet to >>> >>>> try it after 2.10 shift though (and probably wont, given the mess it >>> >>>> created last time). >>> >>>> >>> >>>> Regards, >>> >>>> Mridul >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>>> >>> >>>>> I was wondering actually, do you know if it's possible to added >> shaded >>> >>>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber >>> >>>>> jar)? That's something I could see being really handy in the future. >>> >>>>> >>> >>>>> - Patrick >>> >>>>> >>> >>>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com> wrote: >>> >>>>>> The problem is that plugins are not equivalent. There is AFAIK no >>> >>>>>> equivalent to the maven shader plugin for SBT. >>> >>>>>> There is an SBT plugin which can apparently read POM XML files >>> >>>>>> (sbt-pom-reader). However, it can't possibly handle plugins, >> which >>> >>>>>> is still problematic. >>> >>>>>> >>> >>>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com> >> wrote: >>> >>>>>>> I would prefer keep both of them, it would be better even if that >>> >> means >>> >>>>>>> pom.xml will be generated using sbt. Some company, like my current >>> >> one, >>> >>>>>>> have their own build infrastructures built on top of maven. It is >> not >>> >> easy >>> >>>>>>> to support sbt for these potential spark clients. But I do agree >> to >>> >> only >>> >>>>>>> keep one if there is a promising way to generate correct >>> >> configuration from >>> >>>>>>> the other. >>> >>>>>>> >>> >>>>>>> -Shengzhe >>> >>>>>>> >>> >>>>>>> >>> >>>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com> wrote: >>> >>>>>>>> >>> >>>>>>>> The correct way to exclude dependencies in SBT is actually to >> declare >>> >>>>>>>> a dependency as "provided". I'm not familiar with Maven or its >>> >>>>>>>> dependencySet, but provided will mark the entire dependency tree >> as >>> >>>>>>>> excluded. It is also possible to exclude jar by jar, but this >> is >>> >>>>>>>> pretty error prone and messy. >>> >>>>>>>> >>> >>>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers < >> ko...@tresata.com> >>> >> wrote: >>> >>>>>>>>> yes in sbt assembly you can exclude jars (although i never had a >>> >> need for >>> >>>>>>>>> this) and files in jars. >>> >>>>>>>>> >>> >>>>>>>>> for example i frequently remove log4j.properties, because for >>> >> whatever >>> >>>>>>>>> reason hadoop decided to include it making it very difficult to >> use >>> >> our >>> >>>>>>>> own >>> >>>>>>>>> logging config. >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik < >> c...@apache.org >>> >>> >>> >>>>>>>>> wrote: >>> >>>>>>>>> >>> >>>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote: >>> >>>>>>>>>>> Kos - thanks for chiming in. Could you be more specific about >>> >> what is >>> >>>>>>>>>>> available in maven and not in sbt for these issues? I took a >> look >>> >> at >>> >>>>>>>>>>> the bigtop code relating to Spark. As far as I could tell [1] >> was >>> >> the >>> >>>>>>>>>>> main point of integration with the build system (maybe there >> are >>> >> other >>> >>>>>>>>>>> integration points)? >>> >>>>>>>>>>> >>> >>>>>>>>>>>> - in order to integrate Spark well into existing Hadoop >> stack it >>> >>>>>>>> was >>> >>>>>>>>>>>> necessary to have a way to avoid transitive dependencies >>> >>>>>>>>>> duplications and >>> >>>>>>>>>>>> possible conflicts. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> E.g. Maven assembly allows us to avoid adding _all_ Hadoop >>> >> libs >>> >>>>>>>>>> and later >>> >>>>>>>>>>>> merely declare Spark package dependency on standard Bigtop >>> >>>>>>>> Hadoop >>> >>>>>>>>>>>> packages. And yes - Bigtop packaging means the naming and >>> >> layout >>> >>>>>>>>>> would be >>> >>>>>>>>>>>> standard across all commercial Hadoop distributions that >> are >>> >>>>>>>> worth >>> >>>>>>>>>>>> mentioning: ASF Bigtop convenience binary packages, and >>> >>>>>>>> Cloudera or >>> >>>>>>>>>>>> Hortonworks packages. Hence, the downstream user doesn't >> need >>> >> to >>> >>>>>>>>>> spend any >>> >>>>>>>>>>>> effort to make sure that Spark "clicks-in" properly. >>> >>>>>>>>>>> >>> >>>>>>>>>>> The sbt build also allows you to plug in a Hadoop version >> similar >>> >> to >>> >>>>>>>>>>> the maven build. >>> >>>>>>>>>> >>> >>>>>>>>>> I am actually talking about an ability to exclude a set of >>> >> dependencies >>> >>>>>>>>>> from an >>> >>>>>>>>>> assembly, similarly to what's happening in dependencySet >> sections >>> >> of >>> >>>>>>>>>> assembly/src/main/assembly/assembly.xml >>> >>>>>>>>>> If there is a comparable functionality in Sbt, that would help >>> >> quite a >>> >>>>>>>> bit, >>> >>>>>>>>>> apparently. >>> >>>>>>>>>> >>> >>>>>>>>>> Cos >>> >>>>>>>>>> >>> >>>>>>>>>>>> - Maven provides a relatively easy way to deal with the >> jar-hell >>> >>>>>>>>>> problem, >>> >>>>>>>>>>>> although the original maven build was just Shader'ing >>> >> everything >>> >>>>>>>>>> into a >>> >>>>>>>>>>>> huge lump of class files. Oftentimes ending up with >> classes >>> >>>>>>>>>> slamming on >>> >>>>>>>>>>>> top of each other from different transitive dependencies. >>> >>>>>>>>>>> >>> >>>>>>>>>>> AFIAK we are only using the shade plug-in to deal with >> conflict >>> >>>>>>>>>>> resolution in the assembly jar. These are dealt with in sbt >> via >>> >> the >>> >>>>>>>>>>> sbt assembly plug-in in an identical way. Is there a >> difference? >>> >>>>>>>>>> >>> >>>>>>>>>> I am bringing up the Sharder, because it is an awful hack, >> which is >>> >>>>>>>> can't >>> >>>>>>>>>> be >>> >>>>>>>>>> used in real controlled deployment. >>> >>>>>>>>>> >>> >>>>>>>>>> Cos >>> >>>>>>>>>> >>> >>>>>>>>>>> [1] >>> >>>>>>>> >>> >> >> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> -- >>> >>>>>>>> -- >>> >>>>>>>> Evan Chan >>> >>>>>>>> Staff Engineer >>> >>>>>>>> e...@ooyala.com | >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -- >>> >>>>>> -- >>> >>>>>> Evan Chan >>> >>>>>> Staff Engineer >>> >>>>>> e...@ooyala.com | >>> > >>> > >>> > >>> > -- >>> > -- >>> > Evan Chan >>> > Staff Engineer >>> > e...@ooyala.com |