Some thoughts from the peanut gallery...

On Thu, Jul 9, 2015 at 5:14 PM, Martin Kleppmann <mar...@kleppmann.com> wrote:
> Thanks Julian for calling out the principle of community over code, which is 
> super important. If it was just a matter of code, the Kafka project could 
> simply pull in the Samza code (or write a new stream processor) without 
> asking permission -- but they wouldn't get the Samza community. Thus, I think 
> the community aspect is the most important part of this discussion. If we're 
> talking about merging projects, it's really about merging communities.
>
> I had a chat with a friend who is a Lucene/Solr committer: those were also 
> originally two separate projects, which merged into one. He said the merge 
> was not always easy, but probably a net win for both projects and communities 
> overall. In their community people tend to specialise on either the Lucene 
> part or the Solr part, but that's ok -- it's still a cohesive community 
> nevertheless, and it benefits from close collaboration due to having everyone 
> in the same project. Releases didn't slow down; in fact, they perhaps got 
> faster due to less cross-project coordination overhead. So that allayed my 
> concerns about a big project becoming slow.

It seems to me that looking at the Lucene/Solr merge is only helpful
if you're experiencing the same pain points.  In that case,
enhancements would occur "downstream" (ie. in Solr) that either
wouldn't or would take a long time to make it upstream to Lucene core.
I haven't lurked here long enough to know if that's the case here or
not. In any case, I reckon it'd be best to consider these two things
(community-future/code-future) independently.  Also, around the same
time, Tika and Mahout went the opposite (TLP) direction and have
flourished... I'd also just say that a "subproject" is an anti-pattern
around here...

> Besides community and code/architecture, another consideration is our user 
> base (including those who are not on this mailing list). What is good for our 
> users? I've thought about this more over the last few days:

I'm a new, dumb user so I'm happy to help think through what's good for me:)

> - Reducing users' confusion is good. If someone is adopting Kafka, they will 
> also need some way of processing their data in Kafka. At the moment, the 
> Kafka docs give you consumer APIs but nothing more. Having to choose a 
> separate stream processing framework is a burden on users, especially if that 
> framework uses terminology that is inconsistent with Kafka. If we make Samza 
> a part of Kafka and unify the terminology, it would become a coherent part of 
> the documentation, and be much less confusing for users.

I don't think "having to choose.." is a burden - I simply didn't know
Samza existed until a friend pointed me, but that can be fixed by
convincing Kafka to give a more prominent link.  So far, the
terminology hasn't confused me but maybe that's because my usage so
far is really unsophisticated.

> - Making it easy for users to get started is good. Simplifying the API and 
> configuration is part of it. Making YARN optional is also good. It would also 
> help to be part of the same package that people download, and part of the 
> same documentation. (Simplifying API/config and decoupling from YARN can be 
> done as a separate project; becoming part of the same package would require 
> merging projects.)

FWIW, YARN was one of the compelling aspects of Samza because I didn't
have to wonder about resilience vs. what one has to do with the
Flume-Kafka stuffs. Plus, your grid/bootstrap stuff makes it really
sweet to get started.  Of course, all of YARN has the downside of
being extremely difficult to debug and that has been really annoying
but other than debug logging to a kafka topic or something I'm not how
ya'll can improve that.

Anyway, thanks for Samza, it's been really nice so far...

Thanks,
--tim

Reply via email to