Re: high minimum query latency

2014-06-29 Thread Toby Douglass
(Spark here is using s3). ​

high minimum query latency

2014-06-29 Thread Toby Douglass
Gents, I've been benchmarking Presto, Spark, Impala and Redshift. I've been looking most recently at minimum query latency. In all cases, the cluster consists of eight m1.large EC2 instances. The miniimal data set is a single 3.5mb gzipped file. With Presto (backed by s3), I see 1 to 2 second

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson wrote: > Note that regarding a "long load time", data format means a whole lot in > terms of query performance. If you load all your data into compressed, > columnar Parquet files on local hardware, Spark SQL would also perform far, > far better tha

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Sun, Jun 22, 2014 at 5:53 PM, Debasish Das wrote: > 600s for Spark vs 5s for Redshift...The numbers look much different from > the amplab benchmark... > > https://amplab.cs.berkeley.edu/benchmark/ > > Is it like SSDs or something that's helping redshift or the whole data is > in memory when yo

Re: Shark vs Impala

2014-06-22 Thread Toby Douglass
I've just benchmarked Spark and Impala. Same data (in s3), same query, same cluster. Impala has a long load time, since it cannot load directly from s3. I have to create a Hive table on s3, then insert from that to an Impala table. This takes a long time; Spark took about 600s for the query, Imp

Re: spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 9:10 PM, Zongheng Yang wrote: > Hi Toby, > > It is usually the case that even if the EC2 console says the nodes are > up, they are not really fully initialized. For 16 nodes I have found > `--wait 800` to be the norm that makes things work. > It seems so! resume worked f

Re: spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 8:50 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Yes, you need Python 2.7 to run spark-ec2 and most AMIs come with 2.6 > Ah, yes - I mean to say, Amazon Linux. > .Have you tried either: > >1. Retrying launch with the --resume option? >2. Increasing

spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
Gents, I have been bringing up a cluster on EC2 using the spark_ec2.py script. This works if the cluster has a single slave. This fails if the cluster has sixteen slaves, during the work to transfer the SSH key to the slaves. I cannot currently bring up a large cluster. Can anyone shed any lig

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher < schum...@icsi.berkeley.edu> wrote: > On 06/12/2014 05:47 PM, Toby Douglass wrote: > > > In these future jobs, when I come to load the aggregted RDD, will Spark > > load and only load the columns being accessed by the

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas wrote: > If you need to ad-hoc persist to files, you can can save RDDs using > rdd.saveAsObjectFile(...) [1] and load them afterwards using > sparkContext.objectFile(...) > Appears not available from Python.

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 3:15 PM, FRANK AUSTIN NOTHAFT wrote: > RE: > > > Given that our agg sizes will exceed memory, we expect to cache them to > disk, so save-as-object (assuming there are no out of the ordinary > performance issues) may solve the problem, but I was hoping to store data > is a

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 3:03 PM, Christopher Nguyen wrote: > Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want > for your use case. > Yes. Thankyou. I'm about to see if they exist for Python. > As for Parquet support, that's newly arrived in Spark 1.0.0 together with

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas wrote: > The goal of rdd.persist is to created a cached rdd that breaks the DAG > lineage. Therefore, computations *in the same job* that use that RDD can > re-use that intermediate result, but it's not meant to survive between job > runs. > As I und

initial basic question from new user

2014-06-12 Thread Toby Douglass
Gents, I am investigating Spark with a view to perform reporting on a large data set, where the large data set receives additional data in the form of log files on an hourly basis. Where the data set is large there is a possibility we will create a range of aggregate tables, to reduce the volume