Re: Spark vs Google cloud dataflow

2014-06-27 Thread Khanderao Kand
DataFlow is based on two papers, MillWheel for Stream processing and FlumeJava for programming optimization and abstraction. Millwheel http://research.google.com/pubs/pub41378.html FlumeJava http://dl.acm.org/citation.cfm?id=1806638 Here is my blog entry on this http://texploration.wordpress.com/

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Sorry. Never mind... I guess that's what "Summingbird" is all about. Never heard of it. > On Jun 27, 2014, at 7:10 PM, Marco Shaw wrote: > > Dean: Some interesting information... Do you know where I can read more about > these coming changes to Scalding/Cascading? > >> On Jun 27, 2014, at 9

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading? > On Jun 27, 2014, at 9:40 AM, Dean Wampler wrote: > > ... and to be clear on the point, Summingbird is not limited to MapReduce. It > abstracts over Scalding (which abstra

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Dean Wampler
... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over Scalding (which abstracts over Cascading, which is being moved from MR to Spark) and over Storm for event processing. On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen wrote: > On Thu, Jun 26, 2014 at 9:15 AM,

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Sean Owen
On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia wrote: > Summingbird is for map/reduce. Dataflow is the third generation of google's > map/reduce, and it generalizes map/reduce the way Spark does. See more about > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s Yes, my point was that Summingb

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Martin Goodson
My experience is that gaining 20 spot instances accounts for a tiny fraction of the total time of provisioning a cluster with spark-ec2. This is not (solely) an AWS issue. -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jun 26, 2014 at 10:14 PM, Nicholas C

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Nicholas Chammas
Hmm, I remember a discussion on here about how the way in which spark-ec2 rsyncs stuff to the cluster for setup could be improved, and I’m assuming there are other such improvements to be made. Perhaps those improvements don’t matter much when compared to EC2 instance launch times, but I’m not sure

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Aureliano Buendia
On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > > That’s technically true, but I’d be surprised if there wasn’t a lot of > room for improvement in spark-ec2 regarding cluster launch+config times. > Unfortunately, this is a spark support issue, but an AWS on

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Nicholas Chammas
On Thu, Jun 26, 2014 at 2:26 PM, Michael Bach Bui wrote: The overhead of bringing up a AWS Spark spot instances is NOT the > inherent problem of Spark. That’s technically true, but I’d be surprised if there wasn’t a lot of room for improvement in spark-ec2 regarding cluster launch+config times.

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Michael Bach Bui
"The current problem with Spark is the big overhead and cost of bringing up a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to bring up a 30 node cluster. This makes it non-efficient for computations which may take only 10 - 15 minutes." Hmm, this is a misleading message.The

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Nicholas Chammas
On Thu, Jun 26, 2014 at 10:15 AM, Aureliano Buendia wrote: > On a good day, it takes AWS spot instances 15 - 20 minutes to bring up a > 30 node cluster. This makes it non-efficient for computations which may > take only 10 - 15 minutes. I feel like there should be an issue or something to track

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Aureliano Buendia
On Thu, Jun 26, 2014 at 10:58 AM, Sean Owen wrote: > My first reaction was that Dataflow mapped more to Summingbird, as part > Summingbird is for map/reduce. Dataflow is the third generation of google's map/reduce, and it generalizes map/reduce the way Spark does. See more about this here: http:

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Sean Owen
Dataflow is a hosted service and tries to abstract an entire pipeline; Spark maps to some components in that pipeline and is software. My first reaction was that Dataflow mapped more to Summingbird, as part of it is a higher-level system for doing a specific thing in batch/streaming -- aggregations

Spark vs Google cloud dataflow

2014-06-25 Thread Aureliano Buendia
Hi, Today Google announced their cloud dataflow, which is very similar to spark in performing batch processing and stream processing. How does spark compare to Google cloud dataflow? Are they solutions trying to aim the same problem?