DataFlow is based on two papers, MillWheel for Stream processing and
FlumeJava for programming optimization and abstraction.
Millwheel http://research.google.com/pubs/pub41378.html
FlumeJava http://dl.acm.org/citation.cfm?id=1806638
Here is my blog entry on this
http://texploration.wordpress.com/
Sorry. Never mind... I guess that's what "Summingbird" is all about. Never
heard of it.
> On Jun 27, 2014, at 7:10 PM, Marco Shaw wrote:
>
> Dean: Some interesting information... Do you know where I can read more about
> these coming changes to Scalding/Cascading?
>
>> On Jun 27, 2014, at 9
Dean: Some interesting information... Do you know where I can read more about
these coming changes to Scalding/Cascading?
> On Jun 27, 2014, at 9:40 AM, Dean Wampler wrote:
>
> ... and to be clear on the point, Summingbird is not limited to MapReduce. It
> abstracts over Scalding (which abstra
... and to be clear on the point, Summingbird is not limited to MapReduce.
It abstracts over Scalding (which abstracts over Cascading, which is being
moved from MR to Spark) and over Storm for event processing.
On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen wrote:
> On Thu, Jun 26, 2014 at 9:15 AM,
On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia wrote:
> Summingbird is for map/reduce. Dataflow is the third generation of google's
> map/reduce, and it generalizes map/reduce the way Spark does. See more about
> this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
Yes, my point was that Summingb
My experience is that gaining 20 spot instances accounts for a tiny
fraction of the total time of provisioning a cluster with spark-ec2. This
is not (solely) an AWS issue.
--
Martin Goodson | VP Data Science
(0)20 3397 1240
[image: Inline image 1]
On Thu, Jun 26, 2014 at 10:14 PM, Nicholas C
Hmm, I remember a discussion on here about how the way in which spark-ec2
rsyncs stuff to the cluster for setup could be improved, and I’m assuming
there are other such improvements to be made. Perhaps those improvements
don’t matter much when compared to EC2 instance launch times, but I’m not
sure
On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
>
> That’s technically true, but I’d be surprised if there wasn’t a lot of
> room for improvement in spark-ec2 regarding cluster launch+config times.
>
Unfortunately, this is a spark support issue, but an AWS on
On Thu, Jun 26, 2014 at 2:26 PM, Michael Bach Bui
wrote:
The overhead of bringing up a AWS Spark spot instances is NOT the
> inherent problem of Spark.
That’s technically true, but I’d be surprised if there wasn’t a lot of
room for improvement in spark-ec2 regarding cluster launch+config times.
"The current problem with Spark is the big overhead and cost of bringing up
a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to
bring up a 30 node cluster. This makes it non-efficient for computations
which may take only 10 - 15 minutes."
Hmm, this is a misleading message.The
On Thu, Jun 26, 2014 at 10:15 AM, Aureliano Buendia
wrote:
> On a good day, it takes AWS spot instances 15 - 20 minutes to bring up a
> 30 node cluster. This makes it non-efficient for computations which may
> take only 10 - 15 minutes.
I feel like there should be an issue or something to track
On Thu, Jun 26, 2014 at 10:58 AM, Sean Owen wrote:
> My first reaction was that Dataflow mapped more to Summingbird, as part
>
Summingbird is for map/reduce. Dataflow is the third generation of google's
map/reduce, and it generalizes map/reduce the way Spark does. See more
about this here: http:
Dataflow is a hosted service and tries to abstract an entire pipeline;
Spark maps to some components in that pipeline and is software. My
first reaction was that Dataflow mapped more to Summingbird, as part
of it is a higher-level system for doing a specific thing in
batch/streaming -- aggregations
Hi,
Today Google announced their cloud dataflow, which is very similar to spark
in performing batch processing and stream processing.
How does spark compare to Google cloud dataflow? Are they solutions trying
to aim the same problem?
14 matches
Mail list logo