(Spark here is using s3).
Gents,
I've been benchmarking Presto, Spark, Impala and Redshift.
I've been looking most recently at minimum query latency.
In all cases, the cluster consists of eight m1.large EC2 instances.
The miniimal data set is a single 3.5mb gzipped file.
With Presto (backed by s3), I see 1 to 2 second
On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson wrote:
> Note that regarding a "long load time", data format means a whole lot in
> terms of query performance. If you load all your data into compressed,
> columnar Parquet files on local hardware, Spark SQL would also perform far,
> far better tha
On Sun, Jun 22, 2014 at 5:53 PM, Debasish Das
wrote:
> 600s for Spark vs 5s for Redshift...The numbers look much different from
> the amplab benchmark...
>
> https://amplab.cs.berkeley.edu/benchmark/
>
> Is it like SSDs or something that's helping redshift or the whole data is
> in memory when yo
I've just benchmarked Spark and Impala. Same data (in s3), same query,
same cluster.
Impala has a long load time, since it cannot load directly from s3. I have
to create a Hive table on s3, then insert from that to an Impala table.
This takes a long time; Spark took about 600s for the query, Imp
On Thu, Jun 12, 2014 at 9:10 PM, Zongheng Yang wrote:
> Hi Toby,
>
> It is usually the case that even if the EC2 console says the nodes are
> up, they are not really fully initialized. For 16 nodes I have found
> `--wait 800` to be the norm that makes things work.
>
It seems so! resume worked f
On Thu, Jun 12, 2014 at 8:50 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
> Yes, you need Python 2.7 to run spark-ec2 and most AMIs come with 2.6
>
Ah, yes - I mean to say, Amazon Linux.
> .Have you tried either:
>
>1. Retrying launch with the --resume option?
>2. Increasing
Gents,
I have been bringing up a cluster on EC2 using the spark_ec2.py script.
This works if the cluster has a single slave.
This fails if the cluster has sixteen slaves, during the work to transfer
the SSH key to the slaves. I cannot currently bring up a large cluster.
Can anyone shed any lig
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher <
schum...@icsi.berkeley.edu> wrote:
> On 06/12/2014 05:47 PM, Toby Douglass wrote:
>
> > In these future jobs, when I come to load the aggregted RDD, will Spark
> > load and only load the columns being accessed by the
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas wrote:
> If you need to ad-hoc persist to files, you can can save RDDs using
> rdd.saveAsObjectFile(...) [1] and load them afterwards using
> sparkContext.objectFile(...)
>
Appears not available from Python.
On Thu, Jun 12, 2014 at 3:15 PM, FRANK AUSTIN NOTHAFT wrote:
> RE:
>
> > Given that our agg sizes will exceed memory, we expect to cache them to
> disk, so save-as-object (assuming there are no out of the ordinary
> performance issues) may solve the problem, but I was hoping to store data
> is a
On Thu, Jun 12, 2014 at 3:03 PM, Christopher Nguyen wrote:
> Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want
> for your use case.
>
Yes. Thankyou. I'm about to see if they exist for Python.
> As for Parquet support, that's newly arrived in Spark 1.0.0 together with
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas wrote:
> The goal of rdd.persist is to created a cached rdd that breaks the DAG
> lineage. Therefore, computations *in the same job* that use that RDD can
> re-use that intermediate result, but it's not meant to survive between job
> runs.
>
As I und
Gents,
I am investigating Spark with a view to perform reporting on a large data
set, where the large data set receives additional data in the form of log
files on an hourly basis.
Where the data set is large there is a possibility we will create a range
of aggregate tables, to reduce the volume
14 matches
Mail list logo