On Thu, Jun 26, 2014 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote:
> My first reaction was that Dataflow mapped more to Summingbird, as part > Summingbird is for map/reduce. Dataflow is the third generation of google's map/reduce, and it generalizes map/reduce the way Spark does. See more about this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s It seems Dataflow is based on this paper: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf The paper mentions a few times in-memory computation. But I'm not sure how much Google's implementation resembles to Spark when it comes to in-memory computation. The current problem with Spark is the big overhead and cost of bringing up a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to bring up a 30 node cluster. This makes it non-efficient for computations which may take only 10 - 15 minutes. > of it is a higher-level system for doing a specific thing in > batch/streaming -- aggregations. > > On Wed, Jun 25, 2014 at 8:23 PM, Aureliano Buendia <buendia...@gmail.com> > wrote: > > Hi, > > > > Today Google announced their cloud dataflow, which is very similar to > spark > > in performing batch processing and stream processing. > > > > How does spark compare to Google cloud dataflow? Are they solutions > trying > > to aim the same problem? > > > > >