Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

Mridul Muralidharan Fri, 28 Feb 2014 22:01:22 -0800

On Sat, Mar 1, 2014 at 2:05 AM, Patrick Wendell <pwend...@gmail.com> wrote:
> Hey,
>
> Thanks everyone for chiming in on this. I wanted to summarize these
> issues a bit particularly wrt the constituents involved - does this
> seem accurate?
>
> = Spark Users =
> In general those linking against Spark should be totally unaffected by
> the build choice. Spark will continue to publish well-formed poms and
> jars to maven central. This is a no-op wrt this decision.
>
> = Spark Developers =
> There are two concerns. (a) General day-to-day development and
> packaging and (b) Spark binaries and packages for distribution.
>
> For (a) - sbt seems better because it's just nicer for doing scala
> development (incremental complication is simple, we have some
> home-baked tools for compiling Spark vs. the spark deps etc). The
> arguments that maven has more "general know how", at least so far,
> haven't affected us in the ~2 years we've maintained both builds -
> where adding stuff for Maven is typically just as annoying/difficult
> with sbt.
>
> For (b) - Some non-specific concerns were raised about bugs with the
> sbt assembly package - we should look into this and see what is going
> on. Maven has better out-of-the-box support for publishing to Maven
> central, we'd have to do some manual work on our end to make this work
> well with sbt.



Not non-specific concerns, assembly via sbt is fragile - the (manual)
exclusion rules in sbt project are testament to this.

In particular, I dont see any quantifiable benefits in using sbt over maven.
Incremental compilation, compiling only a subproject, running specific
tests, etc are all available even with maven - so are not
differentiators.
On other hand, sbt does introduce further manual overhead in
dependency management for assembled/shaded jar creation.

Regards,
Mridul




>
> = Downstream Integrators =
> On this one it seems that Maven is the universal favorite, largely
> because of community awareness of Maven and comfort with Maven builds.
> Some things like restructuring the Spark build to inherit config
> values from a vendor build will be not possible with sbt (though
> fairly straightforward to work around). Other cases where vendors have
> directly modified or inherited the Spark build won't work anymore if
> we standardize on SBT. These have no obvious work around at this point
> as far as I see.
>
> - Patrick
>
> On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan <mri...@gmail.com> wrote:
>> On Feb 26, 2014 11:12 PM, "Patrick Wendell" <pwend...@gmail.com> wrote:
>>>
>>> @mridul - As far as I know both Maven and Sbt use fairly similar
>>> processes for building the assembly/uber jar. We actually used to
>>> package spark with sbt and there were no specific issues we
>>> encountered and AFAIK sbt respects versioning of transitive
>>> dependencies correctly. Do you have a specific bug listing for sbt
>>> that indicates something is broken?
>>
>> Slightly longish ...
>>
>> The assembled jar, generated via sbt broke all over the place while I was
>> adding yarn support in 0.6 - and I had to fix sbt project a fair bit to get
>> it to work : we need the assembled jar to submit a yarn job.
>>
>> When I finally submitted those changes to 0.7, it broke even more - since
>> dependencies changed : someone else had thankfully already added maven
>> support by then - which worked remarkably well out of the box (with some
>> minor tweaks) !
>>
>> In theory, they might be expected to work the same, but practically they
>> did not : as I mentioned,  it must just have been luck that maven worked
>> that well; but given multiple past nasty experiences with sbt, and the fact
>> that it does not bring anything compelling or new in contrast, I am fairly
>> against the idea of using only sbt - inspite of maven being unintuitive at
>> times.
>>
>> Regards,
>> Mridul
>>
>>>
>>> @sandy - It sounds like you are saying that the CDH build would be
>>> easier with Maven because you can inherit the POM. However, is this
>>> just a matter of convenience for packagers or would standardizing on
>>> sbt limit capabilities in some way? I assume that it would just mean a
>>> bit more manual work for packagers having to figure out how to set the
>>> hadoop version in SBT and exclude certain dependencies. For instance,
>>> what does CDH about other components like Impala that are not based on
>>> Maven at all?
>>>
>>> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <e...@ooyala.com> wrote:
>>> > I'd like to propose the following way to move forward, based on the
>>> > comments I've seen:
>>> >
>>> > 1.  Aggressively clean up the giant dependency graph.   One ticket I
>>> > might work on if I have time is SPARK-681 which might remove the giant
>>> > fastutil dependency (~15MB by itself).
>>> >
>>> > 2.  Take an intermediate step by having only ONE source of truth
>>> > w.r.t. dependencies and versions.  This means either:
>>> >    a)  Using a maven POM as the spec for dependencies, Hadoop version,
>>> > etc.   Then, use sbt-pom-reader to import it.
>>> >    b)  Using the build.scala as the spec, and "sbt make-pom" to
>>> > generate the pom.xml for the dependencies
>>> >
>>> >     The idea is to remove the pain and errors associated with manual
>>> > translation of dependency specs from one system to another, while
>>> > still maintaining the things which are hard to translate (plugins).
>>> >
>>> >
>>> > On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>> >> We maintain in house spark build using sbt. We have no problem using
>> sbt
>>> >> assembly. We did add a few exclude statements for transitive
>> dependencies.
>>> >>
>>> >> The main enemy of assemblies are jars that include stuff they shouldn't
>>> >> (kryo comes to mind, I think they include logback?), new versions of
>> jars
>>> >> that change the provider/artifact without changing the package (asm),
>> and
>>> >> incompatible new releases (protobuf). These break the transitive
>> resolution
>>> >> process. I imagine that's true for any build tool.
>>> >>
>>> >> Besides shading I don't see anything maven can do sbt cannot, and if I
>>> >> understand it correctly shading is not done currently using the build
>> tool.
>>> >>
>>> >> Since spark is primarily scala/akka based the main developer base will
>> be
>>> >> familiar with sbt (I think?). Switching build tool is always painful. I
>>> >> personally think it is smarter to put this burden on a limited number
>> of
>>> >> upstream integrators than on the community. However that said I don't
>> think
>>> >> its a problem for us to maintain an sbt build in-house if spark
>> switched to
>>> >> maven.
>>> >> The problem is, the complete spark dependency graph is fairly large,
>>> >> and there are lot of conflicting versions in there.
>>> >> In particular, when we bump versions of dependencies - making managing
>>> >> this messy at best.
>>> >>
>>> >> Now, I have not looked in detail at how maven manages this - it might
>>> >> just be accidental that we get a decent out-of-the-box assembled
>>> >> shaded jar (since we dont do anything great to configure it).
>>> >> With current state of sbt in spark, it definitely is not a good
>>> >> solution : if we can enhance it (or it already is ?), while keeping
>>> >> the management of the version/dependency graph manageable, I dont have
>>> >> any objections to using sbt or maven !
>>> >> Too many exclude versions, pinned versions, etc would just make things
>>> >> unmanageable in future.
>>> >>
>>> >>
>>> >> Regards,
>>> >> Mridul
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote:
>>> >>> Actually you can control exactly how sbt assembly merges or resolves
>>> >> conflicts.  I believe the default settings however lead to order which
>>> >> cannot be controlled.
>>> >>>
>>> >>> I do wish for a smarter fat jar plugin.
>>> >>>
>>> >>> -Evan
>>> >>> To be free is not merely to cast off one's chains, but to live in a
>> way
>>> >> that respects & enhances the freedom of others. (#NelsonMandela)
>>> >>>
>>> >>>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <mri...@gmail.com>
>>> >> wrote:
>>> >>>>
>>> >>>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <pwend...@gmail.com
>>>
>>> >> wrote:
>>> >>>>> Evan - this is a good thing to bring up. Wrt the shader plug-in -
>>> >>>>> right now we don't actually use it for bytecode shading - we simply
>>> >>>>> use it for creating the uber jar with excludes (which sbt supports
>>> >>>>> just fine via assembly).
>>> >>>>
>>> >>>>
>>> >>>> Not really - as I mentioned initially in this thread, sbt's assembly
>>> >>>> does not take dependencies into account properly : and can overwrite
>>> >>>> newer classes with older versions.
>>> >>>> From an assembly point of view, sbt is not very good : we are yet to
>>> >>>> try it after 2.10 shift though (and probably wont, given the mess it
>>> >>>> created last time).
>>> >>>>
>>> >>>> Regards,
>>> >>>> Mridul
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> I was wondering actually, do you know if it's possible to added
>> shaded
>>> >>>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber
>>> >>>>> jar)? That's something I could see being really handy in the future.
>>> >>>>>
>>> >>>>> - Patrick
>>> >>>>>
>>> >>>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com> wrote:
>>> >>>>>> The problem is that plugins are not equivalent.  There is AFAIK no
>>> >>>>>> equivalent to the maven shader plugin for SBT.
>>> >>>>>> There is an SBT plugin which can apparently read POM XML files
>>> >>>>>> (sbt-pom-reader).   However, it can't possibly handle plugins,
>> which
>>> >>>>>> is still problematic.
>>> >>>>>>
>>> >>>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com>
>> wrote:
>>> >>>>>>> I would prefer keep both of them, it would be better even if that
>>> >> means
>>> >>>>>>> pom.xml will be generated using sbt. Some company, like my current
>>> >> one,
>>> >>>>>>> have their own build infrastructures built on top of maven. It is
>> not
>>> >> easy
>>> >>>>>>> to support sbt for these potential spark clients. But I do agree
>> to
>>> >> only
>>> >>>>>>> keep one if there is a promising way to generate correct
>>> >> configuration from
>>> >>>>>>> the other.
>>> >>>>>>>
>>> >>>>>>> -Shengzhe
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com> wrote:
>>> >>>>>>>>
>>> >>>>>>>> The correct way to exclude dependencies in SBT is actually to
>> declare
>>> >>>>>>>> a dependency as "provided".   I'm not familiar with Maven or its
>>> >>>>>>>> dependencySet, but provided will mark the entire dependency tree
>> as
>>> >>>>>>>> excluded.   It is also possible to exclude jar by jar, but this
>> is
>>> >>>>>>>> pretty error prone and messy.
>>> >>>>>>>>
>>> >>>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers <
>> ko...@tresata.com>
>>> >> wrote:
>>> >>>>>>>>> yes in sbt assembly you can exclude jars (although i never had a
>>> >> need for
>>> >>>>>>>>> this) and files in jars.
>>> >>>>>>>>>
>>> >>>>>>>>> for example i frequently remove log4j.properties, because for
>>> >> whatever
>>> >>>>>>>>> reason hadoop decided to include it making it very difficult to
>> use
>>> >> our
>>> >>>>>>>> own
>>> >>>>>>>>> logging config.
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik <
>> c...@apache.org
>>> >>>
>>> >>>>>>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>>> >>>>>>>>>>> Kos - thanks for chiming in. Could you be more specific about
>>> >> what is
>>> >>>>>>>>>>> available in maven and not in sbt for these issues? I took a
>> look
>>> >> at
>>> >>>>>>>>>>> the bigtop code relating to Spark. As far as I could tell [1]
>> was
>>> >> the
>>> >>>>>>>>>>> main point of integration with the build system (maybe there
>> are
>>> >> other
>>> >>>>>>>>>>> integration points)?
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>>  - in order to integrate Spark well into existing Hadoop
>> stack it
>>> >>>>>>>> was
>>> >>>>>>>>>>>>    necessary to have a way to avoid transitive dependencies
>>> >>>>>>>>>> duplications and
>>> >>>>>>>>>>>>    possible conflicts.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>    E.g. Maven assembly allows us to avoid adding _all_ Hadoop
>>> >> libs
>>> >>>>>>>>>> and later
>>> >>>>>>>>>>>>    merely declare Spark package dependency on standard Bigtop
>>> >>>>>>>> Hadoop
>>> >>>>>>>>>>>>    packages. And yes - Bigtop packaging means the naming and
>>> >> layout
>>> >>>>>>>>>> would be
>>> >>>>>>>>>>>>    standard across all commercial Hadoop distributions that
>> are
>>> >>>>>>>> worth
>>> >>>>>>>>>>>>    mentioning: ASF Bigtop convenience binary packages, and
>>> >>>>>>>> Cloudera or
>>> >>>>>>>>>>>>    Hortonworks packages. Hence, the downstream user doesn't
>> need
>>> >> to
>>> >>>>>>>>>> spend any
>>> >>>>>>>>>>>>    effort to make sure that Spark "clicks-in" properly.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> The sbt build also allows you to plug in a Hadoop version
>> similar
>>> >> to
>>> >>>>>>>>>>> the maven build.
>>> >>>>>>>>>>
>>> >>>>>>>>>> I am actually talking about an ability to exclude a set of
>>> >> dependencies
>>> >>>>>>>>>> from an
>>> >>>>>>>>>> assembly, similarly to what's happening in dependencySet
>> sections
>>> >> of
>>> >>>>>>>>>>    assembly/src/main/assembly/assembly.xml
>>> >>>>>>>>>> If there is a comparable functionality in Sbt, that would help
>>> >> quite a
>>> >>>>>>>> bit,
>>> >>>>>>>>>> apparently.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Cos
>>> >>>>>>>>>>
>>> >>>>>>>>>>>>  - Maven provides a relatively easy way to deal with the
>> jar-hell
>>> >>>>>>>>>> problem,
>>> >>>>>>>>>>>>    although the original maven build was just Shader'ing
>>> >> everything
>>> >>>>>>>>>> into a
>>> >>>>>>>>>>>>    huge lump of class files. Oftentimes ending up with
>> classes
>>> >>>>>>>>>> slamming on
>>> >>>>>>>>>>>>    top of each other from different transitive dependencies.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> AFIAK we are only using the shade plug-in to deal with
>> conflict
>>> >>>>>>>>>>> resolution in the assembly jar. These are dealt with in sbt
>> via
>>> >> the
>>> >>>>>>>>>>> sbt assembly plug-in in an identical way. Is there a
>> difference?
>>> >>>>>>>>>>
>>> >>>>>>>>>> I am bringing up the Sharder, because it is an awful hack,
>> which is
>>> >>>>>>>> can't
>>> >>>>>>>>>> be
>>> >>>>>>>>>> used in real controlled deployment.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Cos
>>> >>>>>>>>>>
>>> >>>>>>>>>>> [1]
>>> >>>>>>>>
>>> >>
>> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> --
>>> >>>>>>>> --
>>> >>>>>>>> Evan Chan
>>> >>>>>>>> Staff Engineer
>>> >>>>>>>> e...@ooyala.com  |
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> --
>>> >>>>>> Evan Chan
>>> >>>>>> Staff Engineer
>>> >>>>>> e...@ooyala.com  |
>>> >
>>> >
>>> >
>>> > --
>>> > --
>>> > Evan Chan
>>> > Staff Engineer
>>> > e...@ooyala.com  |

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

Reply via email to

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark