Good point. Shows how personal use cases color how we interpret products.
On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen wrote:
> On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons wrote:
>
>> Impala is *not* built on map/reduce, though it was built to replace
>> Hive, which is map/reduce based. It h
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons wrote:
> Impala is *not* built on map/reduce, though it was built to replace Hive,
> which is map/reduce based. It has its own distributed query engine, though
> it does load data from HDFS, and is part of the hadoop ecosystem. Impala
> really shin
Santosh,
To add a bit more to what Nabeel said, Spark and Impala are very different
tools. Impala is *not* built on map/reduce, though it was built to replace
Hive, which is map/reduce based. It has its own distributed query engine,
though it does load data from HDFS, and is part of the hadoop e
As a new user, I can definitely say that my experience with Spark has
been rather raw. The appeal of interactive, batch, and in between all
using more or less straight Scala is unarguable. But the experience
of deploying Spark has been quite painful, mainly about gaps between
compile time and run
Aaron,
I don't think anyone was saying Spark can't handle this data size, given
testimony from the Spark team, Bizo, etc., on large datasets. This has kept
us trying different things to get our flow to work over the course of
several weeks.
Agreed that the first instinct should be "what did I do
>
> Not sure exactly what is happening but perhaps there are ways to
> restructure your program for it to work better. Spark is definitely able to
> handle much, much larger workloads.
+1 @Reynold
Spark can handle big "big data". There are known issues with informing the
user about what went wro
I think we're missing the point a bit. Everything was actually flowing
through smoothly and in a reasonable time. Until it reached the last two
tasks (out of over a thousand in the final stage alone), at which point it
just fell into a coma. Not so much as a cranky message in the logs.
I don't kno
Not sure exactly what is happening but perhaps there are ways to
restructure your program for it to work better. Spark is definitely able to
handle much, much larger workloads.
I've personally run a workload that shuffled 300 TB of data. I've also ran
something that shuffled 5TB/node and stuffed m
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:
>
> Libraries like Scoobi, Scrunch and Scalding (and their associated Java
> versions) provide a Spark-like wrapper around Map/Reduce but my guess is
> that, since they are limited to Map/Reduce under the covers,
We kind of hijacked Santos' original thread, so apologies for that and let
me try to get back to Santos' original question on Map/Reduce versus Spark.
I would say it's worth migrating from M/R, with the following thoughts.
Just my opinion but I would summarize the latest emails in this thread as
Also, our exact same flow but with 1 GB of input data completed fine.
-Suren
On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:
> How wide are the rows of data, either the raw input data or any generated
> intermediate data?
>
> We are at a loss as to why our
How wide are the rows of data, either the raw input data or any generated
intermediate data?
We are at a loss as to why our flow doesn't complete. We banged our heads
against it for a few weeks.
-Suren
On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey
wrote:
> Nothing particularly custom. We've
Nothing particularly custom. We've tested with small (4 node)
development clusters, single-node pseudoclusters, and bigger, using
plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark
master, Spark local, Spark Yarn (client and cluster) modes, with
total me
To clarify, we are not persisting to disk. That was just one of the
experiments we did because of some issues we had along the way.
At this time, we are NOT using persist but cannot get the flow to complete
in Standalone Cluster mode. We do not have a YARN-capable cluster at this
time.
We agree w
It seems to me that you're not taking full advantage of the lazy
evaluation, especially persisting to disk only. While it might be
true that the cumulative size of the RDDs looks like it's 300GB,
only a small portion of that should be resident at any one time.
We've eva
I believe our full 60 days of data contains over ten million unique
entities. Across 10 days I'm not sure, but it should be in the millions. I
haven't verified that myself though. So that's the scale of the RDD we're
writing to disk (each entry is entityId -> profile).
I think it's hard to know ho
I'll respond for Dan.
Our test dataset was a total of 10 GB of input data (full production
dataset for this particular dataflow would be 60 GB roughly).
I'm not sure what the size of the final output data was but I think it was
on the order of 20 GBs for the given 10 GB of input data. Also, I can
When you say "large data sets", how large?
Thanks
On 07/07/2014 01:39 PM, Daniel Siegmann
wrote:
From a development perspective, I vastly prefer Spark to
MapReduce. The MapReduce API is very constrained; Spark's
I don't have those numbers off-hand. Though the shuffle spill to disk was
coming to several gigabytes per node, if I recall correctly.
The MapReduce pipeline takes about 2-3 hours I think for the full 60 day
data set. Spark chugs along fine for awhile and then hangs. We restructured
the flow a few
In addition to Scalding and Scrunch, there is Scoobi. Unlike the others, it
is only Scala (it doesn't wrap a Java framework). All three have fairly
similar APIs and aren't too different from Spark. For example, instead of
RDD you have DList (distributed list) or PCollection (parallel collection)
-
Daniel,
Do you mind sharing the size of your cluster and the production data volumes ?
Thanks
Soumya
> On Jul 7, 2014, at 3:39 PM, Daniel Siegmann wrote:
>
> From a development perspective, I vastly prefer Spark to MapReduce. The
> MapReduce API is very constrained; Spark's API feels muc
On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon wrote:
> For Scala API on map/reduce (hadoop engine) there's a library called
> "Scalding". It's built on top of Cascading. If you have a huge dataset or
> if you consider using map/reduce engine for your job, for any reason, you
> can try Scalding.
>
tosh Karthikeyan
>
>
>
> *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
> *Sent:* Tuesday, July 08, 2014 1:10 AM
> *To:* user@spark.apache.org
> *Subject:* Re: Comparative study
>
>
>
> From a development perspective, I vastly prefer Spark to MapReduce. The
> MapRedu
Thanks Daniel for sharing this info.
Regards,
Santosh Karthikeyan
From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Tuesday, July 08, 2014 1:10 AM
To: user@spark.apache.org
Subject: Re: Comparative study
From a development perspective, I vastly prefer Spark to MapReduce. The
>From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained; Spark's API feels much more natural to
me. Testing and local development is also very easy - creating a local
Spark context is trivial and it reads local files. For your unit tests you
can ju
25 matches
Mail list logo