On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas <gerard.m...@gmail.com> wrote:
> The goal of rdd.persist is to created a cached rdd that breaks the DAG > lineage. Therefore, computations *in the same job* that use that RDD can > re-use that intermediate result, but it's not meant to survive between job > runs. > As I understand it, Spark is designed for interactive querying, in the sense that the caching of intermediate results eliminates the need to recompute those results. However, if intermediate results last only for the duration of a job (e.g. say a python script), how exactly is interactive querying actually performed? a script is not an interactive medium. Is the shell the only medium for interactive querying? Consider a common usage case : a web-site, which offers reporting upon a large data set. Users issue arbitrary queries. A few queries (just with different arguments) dominate the query load, so we thought to create intermediate RDDs to service those queries, so only those order of magnitude or smaller RDDs would need to be processed. Where this is not possible, we can only use Spark for reporting by issuing each query over the whole data set - e.g. Spark is just like Impala is just like Presto is just like [nnn]. The enourmous benefit of RDDs - the entire point of Spark - so profoundly useful here - is not available. What a huge and unexpected loss! Spark seemingly renders itself ordinary. It is for this reason I am surprised to find this functionality is not available. > If you need to ad-hoc persist to files, you can can save RDDs using > rdd.saveAsObjectFile(...) [1] and load them afterwards using > sparkContext.objectFile(...) > I've been using this site for docs; http://spark.apache.org Here we find through the top-of-the-page menus the link "API Docs" -> ""Python API" which brings us to; http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html Where this page does not show the function saveAsObjectFile(). I find now from your link here; https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD What appears to be a second and more complete set of the same documentation, using a different web-interface to boot. It appears at least that there are two sets of documentation for the same APIs, where one set is out of the date and the other not, and the out of date set is that which is linked to from the main site? Given that our agg sizes will exceed memory, we expect to cache them to disk, so save-as-object (assuming there are no out of the ordinary performance issues) may solve the problem, but I was hoping to store data is a column orientated format. However I think this in general is not possible - Spark can *read* Parquet, but I think it cannot write Parquet as a disk-based RDD format. If you want to preserve the RDDs in memory between job runs, you should > look at the Spark-JobServer [3] > Thankyou. I view this with some trepidation. It took two man-days to get Spark running (and I've spent another man day now trying to get a map/reduce to run; I'm getting there, but not there yet) - the bring-up/config experience for end-users is not tested or accurated documented (although to be clear, no better and no worse than is normal for open source; Spark is not exceptional). Having to bring up another open source project is a significant barrier to entry; it's always such a headache. The save-to-disk function you mentioned earlier will allow intermediate RDDs to go to disk, but we do in fact have a use case where in-memory would be useful; it might allow us to ditch Cassandra, which would be wonderful, since it would reduce the system count by one. I have to say, having to install JobServer to achieve this one end seems an extraordinarily heavyweight solution - a whole new application, when all that is wished for is that Spark persists RDDs across jobs, where so small a feature seems to open the door to so much functionality.