Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher < schum...@icsi.berkeley.edu> wrote: > On 06/12/2014 05:47 PM, Toby Douglass wrote: > > > In these future jobs, when I come to load the aggregted RDD, will Spark > > load and only load the columns being accessed by the query? or will > Spark > > l

Re: initial basic question from new user

2014-06-12 Thread Andre Schumacher
Hi, On 06/12/2014 05:47 PM, Toby Douglass wrote: > In these future jobs, when I come to load the aggregted RDD, will Spark > load and only load the columns being accessed by the query? or will Spark > load everything, to convert it into an internal representation, and then > execute the query?

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas wrote: > If you need to ad-hoc persist to files, you can can save RDDs using > rdd.saveAsObjectFile(...) [1] and load them afterwards using > sparkContext.objectFile(...) > Appears not available from Python.

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 3:15 PM, FRANK AUSTIN NOTHAFT wrote: > RE: > > > Given that our agg sizes will exceed memory, we expect to cache them to > disk, so save-as-object (assuming there are no out of the ordinary > performance issues) may solve the problem, but I was hoping to store data > is a

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 3:03 PM, Christopher Nguyen wrote: > Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want > for your use case. > Yes. Thankyou. I'm about to see if they exist for Python. > As for Parquet support, that's newly arrived in Spark 1.0.0 together with

Re: initial basic question from new user

2014-06-12 Thread FRANK AUSTIN NOTHAFT
RE: > Given that our agg sizes will exceed memory, we expect to cache them to disk, so save-as-object (assuming there are no out of the ordinary performance issues) may solve the problem, but I was hoping to store data is a column orientated format. However I think this in general is not possible

Re: initial basic question from new user

2014-06-12 Thread Christopher Nguyen
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want for your use case. As for Parquet support, that's newly arrived in Spark 1.0.0 together with SparkSQL so continue to watch this space. Gerard's suggestion to look at JobServer, which you can generalize as "building a long-r

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas wrote: > The goal of rdd.persist is to created a cached rdd that breaks the DAG > lineage. Therefore, computations *in the same job* that use that RDD can > re-use that intermediate result, but it's not meant to survive between job > runs. > As I und

Re: initial basic question from new user

2014-06-12 Thread Gerard Maas
The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs. for example: val baseData = rawDataRdd.map(...).flatMap(...).reduceByKey

initial basic question from new user

2014-06-12 Thread Toby Douglass
Gents, I am investigating Spark with a view to perform reporting on a large data set, where the large data set receives additional data in the form of log files on an hourly basis. Where the data set is large there is a possibility we will create a range of aggregate tables, to reduce the volume