Re: Ready to talk about Spark 2.0?

Mark Hamstra Sun, 08 Nov 2015 12:11:51 -0800

Yes, that's clearer -- at least to me.

But before going any further, let me note that we are already sliding past
Sean's opening question of "Should we start talking about Spark 2.0?" to
actually start talking about Spark 2.0.  I'll try to keep the rest of this
post at a higher- or meta-level in order to attempt to avoid a somewhat
premature discussion of detailed 2.0 proposals, since I think that we do
still need to answer Sean's question and a couple of related questions
before really diving into the details of 2.0 planning.  The related
questions that I am talking about are: Is Spark 1.x done except for
bug-fixing? and What would definitely make us say that we must begin
working on Spark 2.0?

I'm not going to try to answer my own two questions even though I'm really
interested in how others will answer them, but I will answer Sean's by
saying that it is a good time to start talking about Spark 2.0 -- which is
quite different from saying that we are close to an understanding of what
will differentiate Spark 2.0 or when we want to deliver it.

On the meta-2.0 discussion, I think that it is useful to break "Things that
will be different in 2.0" into some distinct categories.  I see at least
three such categories for openers, although the third will probably need to
be broken down further.

The first is the simplest, would take almost no time to complete, and would
have minimal impact on current Spark users.  This is simply getting rid of
everything that is already marked deprecated in Spark 1.x but that we
haven't already gotten rid of because of our commitment to maintaining API
stability within major versions.  There should be no need for discussion or
apology before getting rid of what is already deprecated -- it's just gone
and it's time to move on.  Kind of a category-1.1 are parts of the the
current public API that are now marked as Experimental or Developer that
should become part of the fully-supported public API in 2.0 -- and there is
room for debate here.

The next category of things that will be different in 2.0 isn't a lot
harder to implement, shouldn't take a lot of time to complete, but will
have some impact on current Spark users.  I'm talking about areas in the
current code that we know don't work the way we want them to and don't have
the public API that we would like, but for which there aren't or can't be
recommended alternatives yet, so the code isn't formally marked as
deprecated.  Again, these are things that we haven't already changed mostly
because of the need to maintain API stability in 1.x.  But because these
haven't already been marked as deprecated, there is potential to catch
existing Spark users by surprise when the API changes.  We don't guarantee
API stability across major version number changes, so there isn't any
reason why we can't make the changes we want, but we should start building
up a comprehensive list of API changes that will occur in Spark 2.0 to at
least minimize the amount of surprise for current Spark users.

I don't already have anything like such a comprehensive list, but one
example of the kind of thing that I am talking about is something that I've
personally been looking at and regretting of late, and that's the
complicated relationships among SparkListener, SQLListener, onJobEnd and
onExecutionEnd.  A lot of this complication is because of the need to
maintain the public API, so we end up with comments like this (
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala#L58):
"Ideally, we need to make sure onExecutionEnd happens after onJobStart and
onJobEnd.  However, onJobStart and onJobEnd run in the listener thread.
Because we cannot add new SQL event types to SparkListener since it's a
public API, we cannot guarantee that."  I think it should be pretty obvious
that we should be making these kinds of architectural and API changes in
2.0 -- they are currently causing Spark developers and often Spark users to
deal with complications that are needed mostly or entirely just to maintain
the public API.  I know that there are other (and larger) examples of this
kind of refactoring that others are itching to start doing, but I'll let
them speak to those specifics as we build up the list of API changes.

The third category of things that will be different in 2.0 is the category
that could get us seriously off track or badly behind schedule.  This is
the category that includes fundamental architectural changes or significant
new features and functionality.  Before we go too far afield in exploring
our wish lists for Spark 2.0, I think we need to try very hard to identify
which architectural changes are really needed to achieve a minimum viable
platform that can meet our goals for the complete Spark 2.0 cycle.  This
starts to get back to my questions of whether Spark 1.x is done and whether
we really need to start working on Spark 2.0.  If we look back at the total
Spark ecosystem in the 0.9 timeframe vs. where it is now on the verge of
1.6.0, it should be clear that an amazing number of additions and
refinements have been made to Spark itself, Spark packages, third-party
tools and applications, etc. -- and all of that was done without requiring
fundamental changes to Spark's architecture.  What I think that implies is
that as items are added to our collective wish list for Spark 2.0, we need
to be asking of each one at least two things: 1) Whether it really requires
a fundamental change in Spark before this new feature or functionality can
be implemented; and 2) If it does require a fundamental change, is that
change (but not necessarily all the new features that need that change)
something that we are willing to commit to completing before Spark 2.0.0
can be released?  Or alternatively, is that a fundamental change that we
can and should put off making for potentially years while the Spark 2.x
development cycle runs its course?  If wish list items don't require
fundamental changes, then we shouldn't feel bad about needing to say for
many of them that they look like good and/or interesting ideas, and things
that we may very well want to include in Spark, but that they may end up in
Spark 2.x instead of 2.0.

To finally get back to your posts, Romi, what I think you are talking about
is the ability to compose things like Spark Jobs, RDD actions and SQL
Executions without needing to explicitly coordinate the collection of
intermediate results to the Driver and the redistribution of data to the
Executors.  This is the kind of thing that is already done in some respects
in transformations like RDD#sortByKey, but that actually breaks Spark's
claim that transformations are lazy.  Wanting to be able to compose things
in Spark in a manner more in line with what functional programmers expect
and doing so without breaking other expectations of Spark users is
something that has been on several others' wish lists for awhile now.  A
few attempts have been made to address the issue within the Spark 1.x
architecture, and some of the recent additions that Matei has made in
regard to realizing adaptive DAG scheduling may allow us to push things
further within Spark 1.x, but this may also be the kind of thing that will
prompt us to make deeper changes in Spark 2.0.

Where I thought you going at first is another category three item: Whether
Spark should be fundamentally changed to allow streams to be handled at the
event level instead of (or in addition to) micro-batches.

So, from my perspective that is a meta-framework that I think is useful to
shape the Spark 2.0 discussion, a couple of category three wish list items,
and bunch of questions that I'm not even going to try to answer on my own
-- but looking forward to the Spark 2.0 discussion.

On Sun, Nov 8, 2015 at 8:14 AM, Romi Kuntsman <r...@totango.com> wrote:

> Hi, thanks for the feedback
> I'll try to explain better what I meant.
>
> First we had RDDs, then we had DataFrames, so could the next step be
> something like stored procedures over DataFrames?
> So I define the whole calculation flow, even if it includes any "actions"
> in between, and the whole thing is planned and executed in a super
> optimized way once I tell it "go!"
>
> What I mean by "feels like scripted" is that actions come back to the
> driver, like they would if you were in front of a command prompt.
> But often the flow contains many steps with actions in between - multiple
> levels of aggregations, iterative machine learning algorithms etc.
> Sending the whole "workplan" to the Spark framework would be, as I see it,
> the next step of it's evolution, like stored procedures send a logic with
> many SQL queries to the database.
>
> Was it more clear this time? :)
>
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Sun, Nov 8, 2015 at 5:59 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> romi,
>> unless am i misunderstanding your suggestion you might be interested in
>> projects like the new mahout where they try to abstract out the engine with
>> bindings, so that they can support multiple engines within a single
>> platform. I guess cascading is heading in a similar direction (although no
>> spark or flink yet there, just mr1 and tez).
>>
>> On Sun, Nov 8, 2015 at 6:33 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Major releases can change APIs, yes. Although Flink is pretty similar
>>> in broad design and goals, the APIs are quite different in
>>> particulars. Speaking for myself, I can't imagine merging them, as it
>>> would either mean significantly changing Spark APIs, or making Flink
>>> use Spark APIs. It would mean effectively removing one project which
>>> seems infeasible.
>>>
>>> I am not sure of what you're saying the difference is, but I would not
>>> describe Spark as primarily for interactive use.
>>>
>>> Philosophically, I don't think One Big System to Rule Them All is a
>>> good goal. One project will never get it all right even within one
>>> niche. It's actually valuable to have many takes on important
>>> problems. Hence any problem worth solving gets solved 10 times. Just
>>> look at all those SQL engines and logging frameworks...
>>>
>>> On Sun, Nov 8, 2015 at 10:53 AM, Romi Kuntsman <r...@totango.com> wrote:
>>> > A major release usually means giving up on some API backward
>>> compatibility?
>>> > Can this be used as a chance to merge efforts with Apache Flink
>>> > (https://flink.apache.org/) and create the one ultimate open source
>>> big data
>>> > processing system?
>>> > Spark currently feels like it was made for interactive use (like
>>> Python and
>>> > R), and when used others (batch/streaming), it feels like scripted
>>> > interactive instead of really a standalone complete app. Maybe some
>>> base
>>> > concepts may be adapted?
>>> >
>>> > (I'm not currently a committer, but as a heavy Spark user I'd love to
>>> > participate in the discussion of what can/should be in Spark 2.0)
>>> >
>>> > Romi Kuntsman, Big Data Engineer
>>> > http://www.totango.com
>>> >
>>> > On Fri, Nov 6, 2015 at 2:53 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>> > wrote:
>>> >>
>>> >> Hi Sean,
>>> >>
>>> >> Happy to see this discussion.
>>> >>
>>> >> I'm working on PoC to run Camel on Spark Streaming. The purpose is to
>>> have
>>> >> an ingestion and integration platform directly running on Spark
>>> Streaming.
>>> >>
>>> >> Basically, we would be able to use a Camel Spark DSL like:
>>> >>
>>> >>
>>> >>
>>> from("jms:queue:foo").choice().when(predicate).to("job:bar").when(predicate).to("hdfs:path").otherwise("file:path")....
>>> >>
>>> >> Before a formal proposal (I have to do more work there), I'm just
>>> >> wondering if such framework can be a new Spark module (Spark
>>> Integration for
>>> >> instance, like Spark ML, Spark Stream, etc).
>>> >>
>>> >> Maybe it could be a good candidate for an addition in a "major"
>>> release
>>> >> like Spark 2.0.
>>> >>
>>> >> Just my $0.01 ;)
>>> >>
>>> >> Regards
>>> >> JB
>>> >>
>>> >>
>>> >> On 11/06/2015 01:44 PM, Sean Owen wrote:
>>> >>>
>>> >>> Since branch-1.6 is cut, I was going to make version 1.7.0 in JIRA.
>>> >>> However I've had a few side conversations recently about Spark 2.0,
>>> and
>>> >>> I know I and others have a number of ideas about it already.
>>> >>>
>>> >>> I'll go ahead and make 1.7.0, but thought I'd ask, how much other
>>> >>> interest is there in starting to plan Spark 2.0? is that even on the
>>> >>> table as the next release after 1.6?
>>> >>>
>>> >>> Sean
>>> >>
>>> >>
>>> >> --
>>> >> Jean-Baptiste Onofré
>>> >> jbono...@apache.org
>>> >> http://blog.nanthrax.net
>>> >> Talend - http://www.talend.com
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>>> >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>

Re: Ready to talk about Spark 2.0?

Reply via email to