Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Colin McCabe Sat, 31 May 2014 14:10:23 -0700

On Sat, May 31, 2014 at 10:45 AM, Patrick Wendell <pwend...@gmail.com>
wrote:


> One other consideration popped into my head:
>
> 5. Shading our dependencies could mess up our external API's if we
> ever return types that are outside of the spark package because we'd
> then be returned shaded types that users have to deal with. E.g. where
> before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
> some.namespace.AvroFlumeEvent. Then users downstream would have to
> deal with converting our types into the correct namespace if they want
> to inter-operate with other libraries. We generally try to avoid ever
> returning types from other libraries, but it would be good to audit
> our API's and see if we ever do this.


That's a good point.  It seems to me that if Spark is returning a type in
the public API, that type is part of the public API (for better or worse).
 So this is a case where you wouldn't want to shade that type.  But it
would be nice to avoid doing this, for exactly the reasons you state...

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell <pwend...@gmail.com>
> wrote:
> > Spark is a bit different than Hadoop MapReduce, so maybe that's a
> > source of some confusion. Spark is often used as a substrate for
> > building different types of analytics applications, so @DeveloperAPI
> > are internal API's that we'd like to expose to application writers,
> > but that might be more volatile. This is like the internal API's in
> > the linux kernel, they aren't stable, but of course we try to minimize
> > changes to them. If people want to write lower-level modules against
> > them, that's fine with us, but they know the interfaces might change.
>

MapReduce is used as a substrate in a lot of cases, too.  Hive has
traditionally created MR jobs to do what it needs to do.  Similarly, Oozie
can create MR jobs.  It seems that what @DeveloperAPI is pretty similar to
@LimitedPrivate in Hadoop.  If I understand correctly, your hope is that
frameworks will use @DeveloperAPI, but individual application developers
will steer clear.  That is a good plan, as long as you can ensure that the
framework developers are willing to lock their versions to a certain Spark
version.  Otherwise they will make the same arguments we've heard before,
that they don't want to transition off of a deprecated @DeveloperAPI
because they want to keep support for Spark 1.0.0 (or whatever).  We hear
these arguments in Hadoop all the time...  now that spark as a 1.0 release
they will carry more weight.  Remember, Hadoop APIs started nice and simple
too :)

>
> > This has worked pretty well over the years, even with many different
> > companies writing against those API's.
> >
> > @Experimental are user-facing features we are trying out. Hopefully
> > that one is more clear.
> >
> > In terms of making a big jar that shades all of our dependencies - I'm
> > curious how that would actually work in practice. It would be good to
> > explore. There are a few potential challenges I see:
> >
> > 1. If any of our dependencies encode class name information in IPC
> > messages, this would break. E.g. can you definitely shade the Hadoop
> > client, protobuf, hbase client, etc and have them send messages over
> > the wire? This could break things if class names are ever encoded in a
> > wire format.
>

Google protobuffers assume a fixed schema.  That is to say, they do not
include metadata identifying the types of what is placed in them.  The
types are determined by convention.  It is possible to change the java
package in which the protobuf classes reside with no harmful effects.  (See
HDFS-4909 for an example of this).  The RPC itself does include a java
class name for the interface we're talking to, though.  The code for
handling this is all under our control, though, so if we had to make any
minor modifications to make shading work, we could.

> 2. Many libraries like logging subsystems, configuration systems, etc
> > rely on static state and initialization. I'm not totally sure how e.g.
> > slf4j initializes itself if you have both a shaded and non-shaded copy
> > of slf4j present.
>

I guess the worst case scenario would be that the shaded version of slf4j
creates a log file, but then the app's unshaded version overwrites that log
file.  I don't see how the two versions could "cooperate" since they aren't
sharing static state.  The only solutions I can see are leaving slf4j
unshaded, or setting up separate log files for the spark-core versus the
application.  I haven't thought this through completely, but my gut feeling
is that if you're sharing a log file, you probably want to share the
logging code too.


> > 3. This would mean the spark-core jar would be really massive because
> > it would inline all of our deps. We've actually been thinking of
> > avoiding the current assembly jar approach because, due to scala
> > specialized classes, our assemblies now have more than 65,000 class
> > files in them leading to all kinds of bad issues. We'd have to stick
> > with a big uber assembly-like jar if we decide to shade stuff.
> > 4. I'm not totally sure how this would work when people want to e.g.
> > build Spark with different Hadoop versions. Would we publish different
> > shaded uber-jars for every Hadoop version? Would the Hadoop dep just
> > not be shaded... if so what about all it's dependencies.
>

I wonder if it would be possible to put Hadoop and its dependencies "in a
box," (as it were) by using a separate classloader for them.  That might
solve this without requiring an uber-jar.  It would be nice to not have to
transfer all that stuff each time you start a job... in a perfect world,
the stuff that had not changed would not need to be transferred (thinking
out loud here)

best,
Colin


>
> > Anyways just some things to consider... simplifying our classpath is
> > definitely an avenue worth exploring!
> >
> >
> >
> >
> > On Fri, May 30, 2014 at 2:56 PM, Colin McCabe <cmcc...@alumni.cmu.edu>
> wrote:
> >> On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell <pwend...@gmail.com>
> wrote:
> >>
> >>> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
> >>> way better about this with 2.2+ and I think it's great progress.
> >>>
> >>> We have well defined API levels in Spark and also automated checking
> >>> of API violations for new pull requests. When doing code reviews we
> >>> always enforce the narrowest possible visibility:
> >>>
> >>> 1. private
> >>> 2. private[spark]
> >>> 3. @Experimental or @DeveloperApi
> >>> 4. public
> >>>
> >>> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
> >>> a build failure.
> >>>
> >>>
> >> That's really excellent.  Great job.
> >>
> >> I like the private[spark] visibility level-- sounds like this is another
> >> way Scala has greatly improved on Java.
> >>
> >> The Scala compiler prevents anyone external from using 1 or 2. We do
> >>> have "bytecode public but annotated" (3) API's that we might change.
> >>> We spent a lot of time looking into whether these can offer compiler
> >>> warnings, but we haven't found a way to do this and do not see a
> >>> better alternative at this point.
> >>>
> >>
> >> It would be nice if the production build could strip this stuff out.
> >>  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and
> we
> >> know how those turned out.
> >>
> >>
> >>> Regarding Scala compatibility, Scala 2.11+ is "source code
> >>> compatible", meaning we'll be able to cross-compile Spark for
> >>> different versions of Scala. We've already been in touch with Typesafe
> >>> about this and they've offered to integrate Spark into their
> >>> compatibility test suite. They've also committed to patching 2.11 with
> >>> a minor release if bugs are found.
> >>>
> >>
> >> Thanks, I hadn't heard about this plan.  Hopefully we can get everyone
> on
> >> 2.11 ASAP.
> >>
> >>
> >>> Anyways, my point is we've actually thought a lot about this already.
> >>>
> >>> The CLASSPATH thing is different than API stability, but indeed also a
> >>> form of compatibility. This is something where I'd also like to see
> >>> Spark have better isolation of user classes from Spark's own
> >>> execution...
> >>>
> >>>
> >> I think the best thing to do is just "shade" all the dependencies.  Then
> >> they will be in a different namespace, and clients can have their own
> >> versions of whatever dependencies they like without conflicting.  As
> >> Marcelo mentioned, there might be a few edge cases where this breaks
> >> reflection, but I don't think that's an issue for most libraries.  So at
> >> worst case we could end up needing apps to follow us in lockstep for
> Kryo
> >> or maybe Akka, but not the whole kit and caboodle like with Hadoop.
> >>
> >> best,
> >> Colin
> >>
> >>
> >> - Patrick
> >>>
> >>>
> >>>
> >>> On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin <van...@cloudera.com>
> >>> wrote:
> >>> > On Fri, May 30, 2014 at 12:05 PM, Colin McCabe <
> cmcc...@alumni.cmu.edu>
> >>> wrote:
> >>> >> I don't know if Scala provides any mechanisms to do this beyond what
> >>> Java provides.
> >>> >
> >>> > In fact it does. You can say something like "private[foo]" and the
> >>> > annotated element will be visible for all classes under "foo" (where
> >>> > "foo" is any package in the hierarchy leading up to the class).
> That's
> >>> > used a lot in Spark.
> >>> >
> >>> > I haven't fully looked at how the @DeveloperApi is used, but I agree
> >>> > with you - annotations are not a good way to do this. The Scala
> >>> > feature above would be much better, but it might still leak things at
> >>> > the Java bytecode level (don't know how Scala implements it under the
> >>> > cover, but I assume it's not by declaring the element as a Java
> >>> > "private").
> >>> >
> >>> > Another thing is that in Scala the default visibility is public,
> which
> >>> > makes it very easy to inadvertently add things to the API. I'd like
> to
> >>> > see more care in making things have the proper visibility - I
> >>> > generally declare things private first, and relax that as needed.
> >>> > Using @VisibleForTesting would be great too, when the Scala
> >>> > private[foo] approach doesn't work.
> >>> >
> >>> >> Does Spark also expose its CLASSPATH in
> >>> >> this way to executors?  I was under the impression that it did.
> >>> >
> >>> > If you're using the Spark assemblies, yes, there is a lot of things
> >>> > that your app gets exposed to. For example, you can see Guava and
> >>> > Jetty (and many other things) there. This is something that has
> always
> >>> > bugged me, but I don't really have a good suggestion of how to fix
> it;
> >>> > shading goes a certain way, but it also breaks codes that uses
> >>> > reflection (e.g. Class.forName()-style class loading).
> >>> >
> >>> > What is worse is that Spark doesn't even agree with the Hadoop code
> it
> >>> > depends on; e.g., Spark uses Guava 14.x while Hadoop is still in
> Guava
> >>> > 11.x. So when you run your Scala app, what gets loaded?
> >>> >
> >>> >> At some point we will also have to confront the Scala version issue.
> >>>  Will
> >>> >> there be flag days where Spark jobs need to be upgraded to a new,
> >>> >> incompatible version of Scala to run on the latest Spark?
> >>> >
> >>> > Yes, this could be an issue - I'm not sure Scala has a policy towards
> >>> > this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
> >>> > binary compatibility.
> >>> >
> >>> > Scala also makes some API updates tricky - e.g., adding a new named
> >>> > argument to a Scala method is not a binary compatible change (while,
> >>> > e.g., adding a new keyword argument in a python method is just fine).
> >>> > The use of implicits and other Scala features make this even more
> >>> > opaque...
> >>> >
> >>> > Anyway, not really any solutions in this message, just a few comments
> >>> > I wanted to throw out there. :-)
> >>> >
> >>> > --
> >>> > Marcelo
> >>>
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Reply via email to