Re: Spark Improvement Proposals

Matei Zaharia Thu, 06 Oct 2016 21:14:40 -0700

Hey Cody,

Thanks for bringing these things up. You're talking about quite a few different 
things here, but let me get to them each in turn.

1) About technical / design discussion -- I fully agree that everything big 
should go through a lot of review, and I like the idea of a more formal way to 
propose and comment on larger features. So far, all of this has been done 
through JIRA, but as a start, maybe marking JIRAs as large (we often use 
Umbrella for this) and also opening a thread on the list about each such JIRA 
would help. For Structured Streaming in particular, FWIW, there was a pretty 
complete doc on the proposed semantics at 
https://issues.apache.org/jira/browse/SPARK-8360 since March. But it's true 
that other things such as the Kafka source for it didn't have as much design on 
JIRA. Nonetheless, this component is still early on and there's still a lot of 
time to change it, which is happening.

2) About what people say at Reactive Summit -- there will always be trolls, but 
just ignore them and build a great project. Those of us involved in the project 
for a while have long seen similar stuff, e.g. a prominent company saying Spark 
doesn't scale past 100 nodes when there were many documented instances to the 
contrary, and the best answer is to just make the project better. This same 
company, if you read their website now, recommends Apache Spark for most 
anything. For streaming in particular, there is a lot of confusion because many 
of the concepts aren't well-defined (e.g. what is "at least once", etc), and 
it's also a crowded space. But Spark Streaming prioritizes a few things that it 
does very well: correctness (you can easily tell what the app will do, and it 
does the same thing despite failures), ease of programming (which also requires 
correctness), and scalability. We should of course both explain what it does in 
more places and work on improving it where needed (e.g. adding a higher level 
API with Structured Streaming and built-in primitives for external timestamps).

3) About number and diversity of committers -- the PMC is always working to 
expand these, and you should email people on the PMC (or even the whole list) 
if you have people you'd like to propose. In general I think nearly all 
committers added in the past year were from organizations that haven't long 
been involved in Spark, and the number of committers continues to grow pretty 
fast.

4) Finally, about better organizing JIRA, marking dead issues, etc, this would 
be great and I think we just need a concrete proposal for how to do it. It 
would be best to point to an existing process that someone else has used here 
BTW so that we can see it in action.

Matei

> On Oct 6, 2016, at 7:51 PM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
> 
> But I just got back from the Reactive Summit, and this is what I observed:
> 
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
> 
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
> 
> Right now Spark is suffering from its own success, and I think
> something needs to change.
> 
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
> 
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
> 
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
> 
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
> If someone wants to work on something, let them own the ticket and set
> a deadline. If they don't meet it, close it or reassign it.
> 
> This is not me putting on an Apache Bureaucracy hat.  This is me
> saying, as a fellow hacker and loyal dissenter, something is wrong
> with the culture and process.
> 
> Please, let's change it.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark Improvement Proposals

Reply via email to