Re: Spark vs Google cloud dataflow

Aureliano Buendia Thu, 26 Jun 2014 07:16:40 -0700

On Thu, Jun 26, 2014 at 10:58 AM, Sean Owen <[email protected]> wrote:

> My first reaction was that Dataflow mapped more to Summingbird, as part
>

Summingbird is for map/reduce. Dataflow is the third generation of google's
map/reduce, and it generalizes map/reduce the way Spark does. See more
about this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s

It seems Dataflow is based on this paper:
http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf

The paper mentions a few times in-memory computation. But I'm not sure how
much Google's implementation resembles to Spark when it comes to in-memory
computation.

The current problem with Spark is the big overhead and cost of bringing up
a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to
bring up a 30 node cluster. This makes it non-efficient for computations
which may take only 10 - 15 minutes.

> of it is a higher-level system for doing a specific thing in
> batch/streaming -- aggregations.
>
> On Wed, Jun 25, 2014 at 8:23 PM, Aureliano Buendia <[email protected]>
> wrote:
> > Hi,
> >
> > Today Google announced their cloud dataflow, which is very similar to
> spark
> > in performing batch processing and stream processing.
> >
> > How does spark compare to Google cloud dataflow? Are they solutions
> trying
> > to aim the same problem?
> >
> >
>

Re: Spark vs Google cloud dataflow

Reply via email to