subject:"Re\: Spark SQL \- Long running job"

Re: Spark SQL - Long running job

2015-02-23 Thread Cheng Lian

I meant using |saveAsParquetFile|. As for partition number, you can always control it with |spark.sql.shuffle.partitions| property. Cheng On 2/23/15 1:38 PM, nitin wrote: I believe calling processedSchemaRdd.persist(DISK) and processedSchemaRdd.checkpoint() only persists data and I will lose

Re: Spark SQL - Long running job

2015-02-22 Thread nitin

I believe calling processedSchemaRdd.persist(DISK) and processedSchemaRdd.checkpoint() only persists data and I will lose all the RDD metadata and when I re-start my driver, that data is kind of useless for me (correct me if I am wrong). I thought of doing processedSchemaRdd.saveAsParquetFile (hdf

Re: Spark SQL - Long running job

2015-02-22 Thread Cheng Lian

How about persisting the computed result table first before caching it? So that you only need to cache the result table after restarting your service without recomputing it. Somewhat like checkpointing. Cheng On 2/22/15 12:55 AM, nitin wrote: Hi All, I intend to build a long running spark ap